Unsupervised and knowledge-poor approaches to sentiment analysis
Abstract
Sentiment analysis focuses upon automatic classiffication of a document's sentiment (and more generally extraction of opinion from text). Ways of expressing sentiment have been shown to be dependent on what a document is about (domain-dependency). This complicates supervised methods for sentiment analysis which rely on extensive use of training data or linguistic resources that are usually either domain-specific or generic. Both kinds of resources prevent classiffiers from performing well across a range of domains, as this requires appropriate in-domain (domain-specific) data. This thesis presents a novel unsupervised, knowledge-poor approach to sentiment analysis aimed at creating a domain-independent and multilingual sentiment analysis system. The approach extracts domain-specific resources from documents that are to be processed, and uses them for sentiment analysis. This approach does not require any training corpora, large sets of rules or generic sentiment lexicons, which makes it domain- and languageindependent but at the same time able to utilise domain- and language-specific information. The thesis describes and tests the approach, which is applied to diffeerent data, including customer reviews of various types of products, reviews of films and books, and news items; and to four languages: Chinese, English, Russian and Japanese. The approach is applied not only to binary sentiment classiffication, but also to three-way sentiment classiffication (positive, negative and neutral), subjectivity classifiation of documents and sentences, and to the extraction of opinion holders and opinion targets. Experimental results suggest that the approach is often a viable alternative to supervised systems, especially when applied to large document collections.
Unsupervised and knowledge-poor approaches to sentiment analysis
Approaches to Sentiment Analysis
Taras Zagibalov
Submitted for the degree of Doctor of Philosophy
University of Sussex
September 2010
Declaration
I hereby declare that this thesis has not been and will not be, submitted in whole or in
part to another University for the award of any other degree.
Signature:.............................................
Taras Zagibalov
UNIVERSITY OF SUSSEX
TARAS ZAGIBALOV (DPHIL)
UNSUPERVISED AND KNOWLEDGE-POOR APPROACHES TO SENTIMENT ANALYSIS
SUMMARY
Sentiment analysis focuses upon automatic classication of a document's sentiment (and
more generally extraction of opinion from text). Ways of expressing sentiment have been
shown to be dependent on what a document is about (domain-dependency). This com-
plicates supervised methods for sentiment analysis which rely on extensive use of training
data or linguistic resources that are usually either domain-specic or generic. Both kinds
of resources prevent classiers from performing well across a range of domains, as this
requires appropriate in-domain (domain-specic) data.
This thesis presents a novel unsupervised, knowledge-poor approach to sentiment ana-
lysis aimed at creating a domain-independent and multilingual sentiment analysis system.
The approach extracts domain-specic resources from documents that are to be processed,
and uses them for sentiment analysis. This approach does not require any training corpora,
large sets of rules or generic sentiment lexicons, which makes it domain- and language-
independent but at the same time able to utilise domain- and language-specic informa-
tion.
The thesis describes and tests the approach, which is applied to dierent data, including
customer reviews of various types of products, reviews of lms and books, and news items;
and to four languages: Chinese, English, Russian and Japanese. The approach is applied
not only to binary sentiment classication, but also to three-way sentiment classication
(positive, negative and neutral), subjectivity classication of documents and sentences,
and to the extraction of opinion holders and opinion targets. Experimental results suggest
that the approach is often a viable alternative to supervised systems, especially when
applied to large document collections.
Acknowledgements
I owe my deepest gratitude to my academic supervisor John Carroll for valuable advice
and friendly guidance, for encouragement and support. I am also grateful to Bill Keller,
my second supervisor, and David Weir, my Thesis committee member, for their guidance
and suggestions.
I am indebted to my colleagues for their support, especially to Jonathon Read, who
was always ready to help and advise me. I would like to deeply thank my friend Martine
Self and her family for their help and friendship.
I am grateful to Ford Foundation Fellowship Program who sponsored my research and
stay in the UK.
I owe a lot to my parents, Maria and Evgenij, for everything they have done for me,
for all their love and care.
This thesis would not have been possible without the love, support and patience of my
beloved wife Olesya. Thank you, my dear!
List of Tables ix
List of Figures xii
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 The Scientic Question . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Contributions of this Work . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Approach and Methodology . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Review 7
2.1 Study of Aect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Private States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Categorical and Dimensional Paradigms . . . . . . . . . . . . . . . . 8
2.1.3 Aect Across Cultures . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.4 Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.5 Text Types and Domains . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Resource Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Challenges of Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.1 Cross-Domain Approaches . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.2 Cross-Language Approaches . . . . . . . . . . . . . . . . . . . . . . . 33
3 Features for Chinese Sentiment Classication 35
3.1 The `Word' in Chinese Language Processing . . . . . . . . . . . . . . . . . . 35
3.1.1 Preliminary Word Segmentation of Chinese Texts . . . . . . . . . . . 37
3.1.2 Preliminary Segmentation Experiment . . . . . . . . . . . . . . . . . 38
3.2 Words and Characters as Features for Sentiment Classication . . . . . . . 40
3.2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Experimental Data and Classication Algorithm . . . . . . . . . . . 43
3.2.3 Evaluation Metrics and Statistical Signicance Test . . . . . . . . . 43
3.3 Experiments with Classication Units . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Unigram-Based Classication . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Zone-Based Classication . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.3 Sentence-Based Classication . . . . . . . . . . . . . . . . . . . . . . 50
3.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Sentiment Score Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.1 Negation Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.2 Length Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.3 Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.4 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 Classier Improvements and Extensions 63
4.1 Dictionary Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.1 Adjustment to Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.2 Adjustment to Topic . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Vocabulary Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 Seed-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.2 Automatic Seed Word Selection . . . . . . . . . . . . . . . . . . . . . 71
4.2.3 Iterative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 Performance Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.1 Score Dierence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.2 Zone Dierence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.3 Using Supervised Techniques to Extend Unsupervised Classier . . . 94
4.3.4 Comparison of Supervised and Unsupervised Classiers . . . . . . . 101
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5 Multilingual Sentiment Classication 106
5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1.1 Language-Specic Issues . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1.2 Book Review Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.1.3 Issues that may Affect Automatic Processing . . . . . . . . . . . . . 117
5.1.4 Movie Review Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2 Supervised Classication Experiments . . . . . . . . . . . . . . . . . . . . . 120
5.2.1 Lexical Unit Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3 Unsupervised Classication Experiments . . . . . . . . . . . . . . . . . . . . 123
5.3.1 Seed-Based Classication . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3.2 Classication Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.3 Score Dierence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.3.4 Zone Dierence for Result Ranking . . . . . . . . . . . . . . . . . . . 130
5.3.5 Combining with Supervised Machine Learning Techniques . . . . . . 130
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6 Multi-Aspect Sentiment Analysis 135
6.1 Three-Way Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1.1 Sentiment Classication . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.1.2 Subjectivity Classication . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2 Sentence-Level Subjectivity and Sentiment Classication . . . . . . . . . . . 141
6.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2.2 Classication Using an Existing Classier . . . . . . . . . . . . . . . 142
6.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.2.4 Stand-Alone Subjectivity Classication . . . . . . . . . . . . . . . . 145
6.2.5 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.3 Opinion Holder and Opinion Target Extraction . . . . . . . . . . . . . . . . 152
6.3.1 Overview of the Approach . . . . . . . . . . . . . . . . . . . . . . . . 152
6.3.2 Language-specic Adjustment . . . . . . . . . . . . . . . . . . . . . . 153
6.3.3 System Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7 Conclusion 161
7.1 Unsupervised Sentiment Classication . . . . . . . . . . . . . . . . . . . . . 161
7.2 Other Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.3 Cross-domain Sentiment Classication . . . . . . . . . . . . . . . . . . . . . 163
7.4 Multilingual Sentiment Classication . . . . . . . . . . . . . . . . . . . . . . 164
7.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.6 Practical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Bibliography 168
List of Tables
3.1 Results of sentiment classication of product reviews from the web-site
IT168, with and without segmentation . . . . . . . . . . . . . . . . . . . . . 40
3.2 Results of unigram-based sentiment classication using dierent types of
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Results of sentiment classication with the characters present only in a
single class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Results of zone-based sentiment classication . . . . . . . . . . . . . . . . . 50
3.5 Results of sentence-based sentiment classication . . . . . . . . . . . . . . . 51
3.6 Precision of the unigram, zone-based and sentence-based sentiment classiers 53
3.7 Results of unigram-based sentiment classication with negation . . . . . . . 55
3.8 Results of zone-based sentiment classication with negation . . . . . . . . . 56
3.9 Results of sentence-based sentiment classication with negation . . . . . . . 57
3.10 Results of unigram-based sentiment classication with length ratio . . . . . 58
3.11 Results of zone-based sentiment classication with length ratio . . . . . . . 58
3.12 Results of sentence-based sentiment classication with length ratio . . . . . 59
3.13 Results of unigram-based sentiment classication with length ratio and neg-
ation check combined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.14 Results of word-based sentiment with dierent features . . . . . . . . . . . . 62
4.1 List of top 10 words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Results of word-based sentiment classication before and after feature ad-
justment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Results of combined classier sentiment classication before and after fea-
ture adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Average of the results of ve runs on a test corpus of the word classier
sentiment classication before and after feature adjustment . . . . . . . . . 66
4.5 Product types and sizes of the test corpora. . . . . . . . . . . . . . . . . . . 67
List of Figures
4.1 Classication results with the seed list all with the score dierence technique. 90
4.2 Classication results the with the seed list all and with the zone dierence
technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3 Classication results with the seed list all and the zone distance technique
(Topics). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4 Classication results with extracted seeds and the zone distance technique
(Topics). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5 Information retrieval simulation results with the seed list all and the zone
distance technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.6 Information retrieval simulation results with extracted seeds and the zone
distance technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.1 Distribution of documents by the number of words contained. . . . . . . . . 111
5.2 Information retrieval simulation results with the zone distance technique. . 130
5.3 Score dierence results for the movie review corpus. . . . . . . . . . . . . . 133
6.1 The distribution of Chinese customer reviews with respect to on Sentiment
Score and Sentiment Density. . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2 The distribution of factual documents with respect to Sentiment Score and
Sentiment Density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.3 The distribution of factual documents with respect to Sentiment Score and
Sentiment Density with the NTU Sentiment Dictionary. . . . . . . . . . . . 141
Introduction
1.1 Background
This thesis is about the automated analysis of sentiment in written language. Sentiment
analysis is concerned not with the topic or factual content in it, but rather with the opinion
expressed in a document. Sentiment analysis has often been broken down into a set of sub-
tasks, including subjectivity classication, opinion classication (sentiment classication),
opinion holder and opinion target extraction, and feature-based opinion mining.
Opinion classication is usually framed as a two-way classication of positive and
negative sentiment, and has been applied at dierent levels: phrases, sentences, documents
and collections of documents. An opinion may have a holder (a person or a group that
expresses an opinion) and a target (an object which is being discussed or evaluated).
Feature-based opinion mining tries to nd opinions about particular features of a product
or service (as opposed to an overall opinion about something).
Automatic classication of document sentiment (and more generally extraction of opin-
ion from text) has recently attracted much interest. One of the main reasons for this is
the importance of such information to companies, other organizations, and individuals.
Applications include marketing research tools that help a company see market or media
reaction towards their brands, products or services. Another type of application is search
engines that help potential purchasers make an informed choice of a product they want
to buy. Such search engines include a sentiment classication subsystem that may not
only present to a customer overall sentiment about a product, but also select positive or
negative reviews to illustrate advantages and shortcomings of a product.
Automated sentiment analysis provides a range of possibilities for researchers in hu-
manities whose studies involve analysis of large amount of human-generated data. For
events are shared in mainstream media and in social media. Analysis of user-generated
content may be very helpful in political studies. For example, monitoring of political de-
bates in social media may help to estimates prospects of political candidates in elections
or evaluate eectiveness of political campaigns. The study of \the language of hatred"
contributes to eorts against political and religious extremism and intolerance. Many
aspects of social studies may benet from automatic analysis of sentiments expressed by
people in ever-growing social networks. This approach oers unintrusive and fast access
to large amount of data.
In recent white paper addressing the role of sentiment analysis in organisations, Grimes
(2010) noted that \one axiom of full-circle sentiment analysis is ability to use all relevant
sentiment sources". This obviously includes resources in dierent languages, of dierent
genres and written in dierent styles. The most widely used approach to opinion and
subjectivity classication is based on supervised machine learning, in which a system
learns from human-annotated training data how to classify documents. However, a major
obstacle for automatic classication of sentiment and subjectivity is often a lack of training
data, which limits the applicability of approaches based on supervised machine learning.
With the rapid growth in the amount of textual data and the emergence of new domains of
knowledge it is virtually impossible to maintain corpora of annotated data that cover all {
or even most { areas of interest. The cost of manual annotation also adds to the problem.
Re-using the same corpus for training classiers for new domains is also not eective:
several studies report decreased accuracy in cross-domain classication (Engstrom, 2004;
Read, 2005; Aue and Gamon, 2005). Indeed, a classier trained in a lm review domain
might consider word unpredictable (e.g. unpredictable plot) to be used to express a positive
characteristic. However, the same word in an car review might be a marker of a negative
sentiment (e.g. unpredictable stirring) (Turney, 2002). A similar problem has also been
observed in classication of documents created over dierent time periods (Read, 2005).
Some words were found to express a certain sentiment only for a denite period of time.
Word ice-axe, for example, was a strong indicator of a positive sentiment because it was
frequently used in mostly positive reviews of a lm that featured a particularly stirring
scene involving this tool.
Rule-based or dictionary-based classications also have similar limitations and they
also rely on a large set of manually created resources used for classication.
A major current challenge, therefore, is to be able to automatically extract sentiment
Most existing solutions are based on adapting systems designed for one language (or
domain) to another. Obviously, there are dierences between cultures, languages and
even within a language (consider the dierence in the language used for evaluations of a
company nancial prospects in a business newspaper and reviews of a hard-rock festival in
a participant's blog). Such dierences make adaptation problematic. Porting a sentiment
analysis system to new languages is even more dicult.
This thesis proposes an approach based on the idea of nding all data needed for
classication within the documents to be classied. Domain-specic data is often hard to
nd, and generic resources, such as for example, sentiment lexicons, often fail to include
all relevant markers of opinion. Even well-known and `obvious' markers of sentiment may
demonstrate a sharp twist in their meaning in certain domains. For example, Ghose et al.
(2007) found that the word good is an indicator of negative sentiment in the domain of
eBay customer reviews: to describe something really good customers tend to use perfect
and excellent, reserving good for polite expression of negative appraisal (as in the package
is good (but might have been better)).
To overcome this problem the approach investigated in this thesis is to bootstrap
sentiment-related data from documents using a very limited number of seed lexical units.
This approach is used across domains, as well as across languages.
1.2 Research Overview
1.2.1 The Scientic Question
The main goal of the research presented in this thesis is to investigate the extent to which it
is possible to build an unsupervised domain-independent cross-lingual sentiment analysis
system. Such a system could be of great utility due to the ever-growing amount of all
kinds of unstructured information in dierent languages which often contain opinions and
evaluations.
1.2.2 Hypotheses
The research explores ve main hypotheses:
Hypothesis 1: Unsupervised systems can be developed for performing sentiment
analysis in dierent domains and in dierent languages that perform comparably
with supervised systems.
language-specic input. Such a system might require only a basic indication of what
positive and negative sentiments are, in the form of lexical `seeds'.
Hypothesis 3: A sentiment-related vocabulary automatically extracted from a corpus
can produce similar or better results compared to a specialised hand-built sentiment
vocabulary.
Hypothesis 4: An automatically acquired training corpus in conjunction with ma-
chine learning techniques can produce sentiment classication results similar or close
to a standard supervised approach.
Hypothesis 5: A uniform notion of `lexical unit' can be used across languages for
sentiment analysis tasks.
1.2.3 Contributions of this Work
This thesis presents a number of novel and signicant contributions to research in senti-
ment analysis:
1. An unsupervised knowledge-poor approach to domain-independent sentiment ana-
lysis
2. Use of the approach as a means of multilingual sentiment analysis
3. Sentiment zones (sequences of characters between punctuation marks) as units of
classication
4. Sentiment score (a score based on the relative frequencies of units in documents of
opposite sentiment) as a technique for sentiment classication
5. Score-dierence technique for ltering out noise in sentiment classication. The
technique is based on calculating the dierence between opposite sentiment scores
of an item.
6. Zone-dierence technique for ranking sentiment classication. Zone-dierence is a
dierence of zones of opposite sentiment in a document.
7. An unsupervised opinion holder and opinion target extraction technique
8. A scale-based sentiment classication, as an alternative to a traditional binary clas-
sication
Literature Review
This chapter presents an overview of approaches to sentiment analysis and the various
research paradigms used. Section 2.1 describes research in `aect' which sets background
for sentiment analysis as part of NLP. The following section (2.2) describes dierent aspects
of sentiment analysis, covering its main tasks, as well as dierent types of features and
techniques used in this research eld; the section also surveys domains where sentiment
analysis is used. Approaches to resource development are discussed in Section 2.3. Section
2.4 discusses the most signicant outstanding challenges in sentiment analysis.
2.1 Study of Aect
This section discusses the theoretical background of sentiment analysis, touching on rel-
evant work in linguistics, psychology and ethnography as these areas provide important
foundations for cross-lingual sentiment analysis.
2.1.1 Private States
The linguistic concept of non-factual information expressed in a text is relatively young.
Quirk et al. (1985) introduced the linguistic term private state that denotes mental or
emotional states, hidden from objective observation. Baneld (1982) proposed a term
for the linguistic expression of private states: subjectivity. Thus subjectivity analysis is
aimed at identication of attributes of private states: the subject who expresses a private
state, the object about whom the state is expressed, the type of the attitude, the intensity
of private state etc. In this sense, subjectivity analysis and sentiment analysis are often
used interchangeably. Pang and Lee (2008) give a dierent, more narrow, NLP-specic,
denition of subjectivity analysis as classifying a given text (a text or a sentence) into one
or more private states).
2.1.2 Categorical and Dimensional Paradigms
Most research in sentiment analysis is based on one of two basic approaches: categorical
and dimensional. The rst approach puts all emotions into a nite number of categories
(e.g. anger, fear, sadness, surprise), while the other one delineates emotions according to
multiple dimensions rather than discrete categories.
The categorical approach is represented by the Cognitive Structure of Emotions (Or-
tony et al., 1988) which provides a taxonomy of emotions based on the dierent conditions
that cause them. But since this approach is based on psychological contexts (for example,
relations between people) which usually are not represented in the text, it is quite dicult
to base any NLP study on it.
Another theory within the categorical paradigm that is derived from psychology is
Appraisal Theory. It claims that all emotions are the result of evaluations (appraisals) of
events that cause specic reactions in dierent people (Scherer and Schorr, 2001). Ap-
praisal Theory is applied to language by Systemic Functional Linguistics as a theory of
evaluation in text. Appraisal Theory analyses the way opinion is expressed in text and
provides taxonomies for systematic identication of expressions of opinions and emotions
in context. The taxonomies not only include words related to certain emotions or opinions
but also cover the way authors interact with other authors and their audience.
According to Appraisal Theory, appraisal consists of three subsystems that function
interactively: attitude, engagement and graduation. Attitude addresses one's feelings
(emotional reactions, judgements of people and appreciations of objects); Engagement is
concerned with the positioning of oneself with respect to the opinions of others and with
the respect to one's own opinions; Graduation considers the ways a language increases or
decreases the attitude and engagement in a text. Since this theory describes linguistic
means of expression of emotions (lists of words that convey appraisal, for example) it can
immediately be applied to NLP studies (for example, Read and Carroll, 2009).
Another way of representing aect is to put it into a multi-dimensional semantic space.
For example, a two-factor structure of aect (described by Watson and Tellegen, 1985)
puts emotion in two dimensions: Pleasantness (from happy to sad) and Engagement (from
surprised to quiet).
Osgood et al. (1971) delineates emotions according to multiple dimensions: the two
or evaluation) and a `strong{weak' axis (the dimension of activation or intensity).
The dimensional understanding of aect is very productive for NLP as a basis for
sentiment classication studies that also use (a very simplied) scale of sentiments ranging
from two-point (positive { negative) to multi-point classications (the `ve-star' system of
Pang and Lee, 2005).
2.1.3 Aect Across Cultures
Since the research presented in this thesis addresses sentiment analysis in a multilingual
context, the cross-cultural aspects of aect are also very relevant. Important questions
include: Is sentiment universal? Is it expressed in comparable ways and can a unied
approach be adopted? Is such an approach potentially applicable to other languages not
tested in this research?
Ekman and Friesen (1971) found that particular facial behaviours are universally asso-
ciated with particular emotions regardless of ethnic or cultural background. The existence
of cross-cultural constants in emotional behaviour suggests that similar constants may be
found in language. This was studied by Osgood et al. (1975) in 20 dierent countries with
the help of about 80 anthropologists, psychologists and linguists. The study was done in
the paradigm of semantic space measurement (Osgood et al., 1971; Osgood, 1976). The
authors' general objective was to demonstrate that three aective dimensions of mean-
ing { Evaluation, Potency, and Activity (E-P-A) { are in fact, pancultural. They found
in particular found that the two most common modes of aect qualication across the
world are GOOD and BIG (or some close synonym). They ranked the qualiers found in
each ethno-linguistic community in terms of both frequency and diversity of usage (i.e.
productivity) and then correlated rankings in terms of translation equivalents, and found
sizable and signicant relationships. Osgood et al. (1975) concluded that \Human beings,
no matter where they live or what language they speak, apparently abstract about the
same properties of things for making comparisons, and they order these dierent modes
of qualifying in roughly the same way in importance".
These ndings suggest that a unied approach to sentiment analysis across multiple
languages is in principle well-founded, providing a solid basis for the work presented in
this thesis.
2.2 Sentiment Analysis
Sentiment analysis has been a popular research topic in recent years and has evolved
into a big and diverse research eld. A number of approaches have been used to create
new research prototype and applied sentiment analysis systems. This section surveys the
various tasks in sentiment analysis and methods utilised to perform them.
2.2.1 Tasks
There are four main tasks that are tackled in present day sentiment analysis research: sub-
jectivity analysis, sentiment classication, opinion summarisation, and opinion extraction
and mining.
Subjectivity Analysis
Subjectivity analysis, as indicated in Chapter 1, aims to distinguish subjective text (docu-
ments, sentences) from factual text. Subjective texts are those that express private states,
which dier them from objective (factual) text that expresses only objective information,
or facts.
Subjectivity analysis is a dicult task. The diculty is mostly caused by the nature of
private states that subjectivity analysis deals with. The subjective or objective nature of
text is hardly ever stated explicitly (Wiebe, 1994) which complicates automatic processing
of information that contains private states. Another challenging aspect of subjectivity
analysis is that documents are almost never entirely either objective or subjective. Even
a single sentence may contain factual information and some subjective evaluation of it.
However a number of studies demonstrate reasonable success in subjectivity analysis.
A widely used technique in NLP, supervised machine learning, is often applied to
subjectivity classication. Yu and Hatzivassiloglou (2003) describe document-level classi-
cation of news items using a Nave Bayes classier. Their research also investigated three
approaches to identifying subjective sentences. The rst was based on a hypothesis that,
within a given topic, opinion sentences will be more similar to other opinion sentences than
to factual sentences. The second used a Nave Bayes classier trained on documents that
were supposed to be subjective (e.g. editorials). The features included words, bigrams,
and trigrams, as well as the parts of speech in each sentence. Thirdly, the authors applied
an algorithm using multiple classiers, each relying on a dierent subset of the features.
The study found that the Nave Bayes classier proved to be the most eective tool for
sentiment classication, multiple classiers slightly increasing performance. Wilson et al.
on a three-way classication (positive, negative and neutral) was proposed by Koppel and
Schler (2006) who stressed the importance of the neutral class for sentiment classication.
Sentiment and Subjectivity Pang and Lee (2004) propose a supervised machine-
learning method of determining polarity that applies text-categorization techniques to
subjective portions of a document only. These portions are extracted using minimum cuts
in graphs. The idea of minimum cuts is inspired by the observation that text spans occur-
ring near each other (within discourse boundaries) may share the same subjectivity status,
other things being equal (Wiebe, 1994). Pang and Lee found that subjectivity detection
can compress reviews into much shorter extracts that still retain polarity information at
a level comparable to that of the full review. These extracts can be used for polarity
classication which improves accuracy (from 82% to 86% for full reviews), suggesting that
they are not only shorter, but also \cleaner" representations of document polarity.
The role of neutral (objective) text in sentiment classication was studied by Koppel
and Schler (2006). The authors showed that in learning polarity, neutral examples cannot
be ignored. Using only negative and positive training examples does not permit accur-
ate classication of neutral examples. Moreover, better distinction between positive and
negative examples can be achieved using neutral training examples. Properly combining
pairwise learned classiers leads to extremely signicant improvement in overall classica-
tion accuracy. But the combination of the classiers depends on the nature of the corpus,
more specically on the nature of the neutral documents in the corpus { whether they are
truly neutral or in fact balanced (containing both sentiments).
Supervised Sentiment Classication Sentiment can be expressed in numerous ways
and some studies have investigated what parts of the language are the most important
for detecting sentiments. For example, Alm et al. (2005) used 14 kinds of features for
supervised machine learning experiments into recognizing emotional passages and on de-
termining their valence (i.e. positive versus negative) with a corpus of children stories.
The authors used a very large set comprising 14 dierent kinds of features: word lists,
syntactic, story-related, orthographic, conjunctions, content BOW (\bag-of-words"), some
of which were found automatically, some manually.
Another type of features was used by Whitelaw et al. (2005b). They used adjectival
appraisal groups as features for supervised sentiment classication of lm reviews. The
appraisal groups, coherent groups of words that express together a particular attitude, are
part of a full appraisal expression as dened in Appraisal Theory (Martin and White, 2005).
The list of appraisal groups was produced semi-automatically, and manually modied to
lter out noise. In total, 1329 terms were produced from 400 seed terms.
Other studies have experimented not only with dierent features but also with vari-
ous machine learning classiers (most notably Support Vector Machines, Nave Bayes,
and Maximum Entropy) and their combinations. Das and Chen (2007) tried a classier
voting technique for extracting small investor sentiment (buy, sell or hold) from stock
message boards. Their approach was based on voting amongst ve classiers: nave clas-
sier (simply counting words with positive or negative meaning), vector distance classier
(a standard vector-based approach), discriminant-based classier (counting discriminant
scores of each word), adjective-adverb phrase classier (counting only noun phrases with
adjectives or adverbs) and a Nave Bayes classier. The features were a hand-picked collec-
tion of nance domain words. In particular, they observed that the Nave Bayes classier
performed quite well, producing fewer false positives.
Sentiment Classication and Linguistics A more linguistic-driven approach was in-
vestigated by Eriksson (2006), who explored a linguistic method that facilitates sentiment
analysis by using more information from a text than traditional methods based on ma-
chine learning. Eriksson's Linguistic Tree Transformation Algorithm is designed to exploit
the syntactic dependencies between words in a sentence and to disambiguate word senses.
Another technique introduced by Eriksson is an objective sentence removal algorithm.
The approach specially addresses two major problems in the area of sentiment analysis,
the non-local dependencies problem and the word-sense disambiguation problem. The
Linguistic Tree Transform Algorithm uses parsing to nd all bigrams (mostly adjective
{ noun phrases) relevant to the sentiment analysis task, while ltering out all irrelevant
ones. Then an Objective Sentence Removal Algorithm lters out all sentences that do not
contain topic words of interest (such as for lm reviews, the names of the lms, directors
and screenwriters or some topic-related nouns). The algorithm is based on the assumption
that some prior knowledge in this domain is readily available for automatic processing.
These two algorithms produce a pruned version of the initial corpus containing only opin-
ionated sentences relevant to the topic (for example, plot descriptions are removed). 100%
accuracy is reported for the experiments with a frequency SVM model run on the data
produced by the two algorithms.
Linguistically-motivated features help improve existing state-of-the-art sentiment clas-
sication results in a task of detecting implicit sentiment, a novel vision of sentiment
classication proposed by Greene and Resnik (2009). Obviously implicit sentiment can-
not be detected by traditional indicators, such as words. This enabled the authors to
investigate the syntactic \packaging" of ideas, studied previously by Greene (2007).
Opinion Summarisation
Opinion Summarisation aims to aggregate opinions on a given topic from multiple doc-
uments (probably from dierent sources) rather than classifying individual documents.
Most approaches start with nding documents relevant to the topic and then classifying
retrieved documents according to their sentiment. The topic might be found automatic-
ally from a set of documents (Hu and Liu, 2004; Chen et al., 2005; Feiguina and Lapalme,
2007) or given as a query (Eguchi and Lavrenko, 2006). The latter approach is close to
opinionated information retrieval as it ranks documents or sentences according to both
topic and sentiment relevance.
Some approaches use a variety of tools for opinion summarisation. In the domain
of lm review summarisation, Zhuang et al. (2006) describe a multi-knowledge based
approach that uses WordNet, movie casts and labelled training data (1100 reviews), as
well as grammatical rules linking feature words and opinion words.
Ku et al. (2006b) present a comprehensive system that summarises web blogs on a
given topic (e.g. animal cloning). The summarisation is then presented by representative
sentences augmented by an opinionated curve showing supportive and non-supportive
degree along the time-line. The authors use a multi-level (word - sentence - document)
sentiment classication system for detecting opinion direction.
Opinion summarisation can be combined with other techniques to produce an all-round
practical application. Liu et al. (2005) describes a system called Opinion Observer which is
capable of semi-automatic sentiment extraction, sentiment summarizing and visualisation.
The system is able to compare sentiments about dierent products. The system is based
on supervised rule discovery from a hand-labelled training corpus.
Opinion Extraction and Mining
Opinion extraction and opinion mining (the two terms are commonly used interchange-
ably) are concerned with extraction of certain aspects of opinion. One such aspect is the
opinion holder (a person or a group that expresses an opinion) and another is the opinion
target (something which is being discussed or evaluated). Feature-based opinion mining
nds to nd opinions about particular features of a product or service (as opposed to an
overall opinion about something).
Opinion Holder Extraction There are two main types of approach to opinion holder
extraction: one based on machine learning and the other using knowledge-based tech-
niques. An example of the rst type is presented by Kim and Hovy (2006) who used
a machine learning technique for opinion holder extraction. As features for their Max-
imum Entropy classier they used selected structural features from a deep parse, based
on a frame representation of opinionated expressions. The frame was built around an
opinion word, with semantic relations between it and opinion holder and target derived
from semantic role labelling within the frames. Choi et al. (2005) consider opinion holder
extraction to be an information extraction task and use a combination of two techniques:
named entity recognition (by training Conditional Random Fields) and information ex-
traction (AutoSlog, a supervised extraction pattern learner). The former models source
identication as a sequence-tagging task; the latter learns extraction patterns.
Knowledge-based approaches utilise hand-build lexicons, parsing, heuristics and onto-
logies. For example, Bloom et al. (2007) describe an opinion holder extraction approach
based on a hand-built lexicon, a combination of heuristic shallow parsing and dependency
parsing, and expectation-maximization word sense disambiguation; they match phrases in
the text with domain-dependent holder type taxonomies.
Kim et al. (2008) exploited a set of communication and appraisal verbs, SentiWordNet,
a named entity recognizer, and a syntactic parser for opinion holder extraction. In each
sentence they looked for the most opinionated word and then ascended the tree to its
rst ancestor node with verbal part of speech, and looked for its subject (a noun phrase)
which was assumed to contain opinion holder candidates. If a subject was not found,
then `author' was set as the opinion holder of the sentence. If a subject was found, then
from the NP chunk, any named entities or opinion holder candidates were extracted as
the opinion holder. If no named entity or opinion holder candidate was found, then the
holder again defaulted to the `author' of the document. Regardless of the previous step, if
a sentence included quotation marks, then the speaker of the quote was extracted as the
opinion holder.
Kim and Hovy (2004) present a system that combines sentiment summarisation and
opinion mining: it nds people who expressed opinion on a given topic as well as orientation
of the opinion. The system operates in four steps. First it selects sentences that contain
both the topic phrase and holder candidates, found by means of BBN's named entity
tagger. Next, it delimits the holder-phrase region. Then the sentence sentiment classier
calculates the polarity of all sentiment-bearing words individually. Finally, the system
on a corpus bootstrapped from a small manually-created corpus. Popescu and Etzioni
(2005) present a system and claim to be the rst to report precision and recall on the tasks
of opinion phrase extraction and opinion phrase polarity determination in the context of
known product features and sentences. This system intensively uses the knowledge mining
tool, KnowItAll, a Web information-extraction system (Etzioni et al., 2005), to extract
product features and opinions regarding them.
Zhang and Varadarajan (2006) identify a new task in opinion extraction: predict-
ing the utility (or, reliability, usefulness, informativeness) of product reviews. Utility is
dened as a multi-aspect feature of customer reviews that combines subjectivity with deep
technical analysis of a product's features. The authors build regression models by incor-
porating a diverse set of features including lexical similarity, part of speech tags and lexical
subjectivity clues.
Titov and McDonald (2008) present a novel framework for extracting the features of
objects from online user reviews. They build statistical models to induce multi-grain top-
ics. The models not only extract features, but also cluster them into coherent topics, e.g.,
waitress and bartender are part of the same topic, sta, for restaurants. This dierentiates
it from much of the previous work which extracts aspects through term frequency analysis
with minimal clustering.
Question Answering
Question answering (QA) is well-established research topic in NLP. A new facet of it is
presented by opinion QA. Yu and Hatzivassiloglou (2003) study separating opinions from
fact, at both the document and sentence level, in the context of QA. Ku et al. (2007a)
dene six opinion question types and use an information retrieval system to detect question
focus. The retrieved information is then processed to match the sentiment of the query.
2.2.2 Techniques
Research in sentiment analysis uses a number of techniques, such as supervised machine
learning, rule- and knowledge-based and some others described beneath.
Supervised Machine Learning
Supervised machine learning is the most frequently used technique in sentiment classica-
tion. To date, the majority of studies have used support vector machines (SVM) and Nave
Bayes (NB). A study of the eectiveness of machine learning techniques was carried out
languages with scarce resources using on-line dictionaries.
Rilo and Wiebe (2003) describe a a semi-supervised technique that learns extraction
patterns from a training corpus produced by high-precision classiers and then applies the
newly found patterns to nd more subjective sentences. The classiers use a manually
created set of features (words and n-grams) to produce two sets of sentences: objective
and subjective. The two sets are then used by a pattern learner to nd patterns that are
mostly used in subjective sentences. The process of learning is based on application of
a large set of syntactic templates to the corpus and extracting all possible patterns that
match the templates. The frequencies of the patterns obtained for each of the classes of the
sentences (objective and subjective) are compared and the most subjectivity-associated
patterns are used to enlarge the feature set of the classiers. In a later study, Wiebe and
Rilo (2005) extend the system by applying machine learning techniques to the extracted
sentences to increase recall.
Reference Data A dierent approach to unsupervised sentiment classication is de-
scribed by Ghose et al. (2007). The authors use an economic context to nd out what
makes a review positive or negative. The approach is based on the observation that on-
line merchants on eBay with positive feedback can sell products for higher prices than
competitors with negative evaluations. This makes it possible to use techniques from eco-
nometrics to identify the `economic value of text' and assign a `dollar value' to each text
snippet, measuring sentiment strength and polarity eectively and without the need for
any annotated resources.
An alternative approach was explored by Read (2009). To nd a document's sentiment
orientation Read compared the document with some prototypes (positive and negative
texts) using their constituents (words and phrases).
Linguistic Resources Subasic and Huettner (2001) present an approach based on a fu-
sion of natural-language processing and fuzzy logic techniques for analysing aect content
in free text. The linguistic resource for the approach is a hand-crafted fuzzy aect lexicon,
from which other resources are generated: a fuzzy thesaurus and aect category groups. A
text is tagged with aect categories from the lexicon, and the aect categories' centralities
and intensities are combined using techniques from fuzzy logic to produce aect sets {
fuzzy sets that represent the aect quality of a document.
Zhuang et al. (2006) use WordNet, statistical analysis and movie knowledge for movie
review mining and summarisation.
convey negative sentiment, while adjective + noun is often used for expressing positive
sentiment. Wiebe et al. (2004) used collocations to identify xed n-grams, for example:
worst-adj of-prep all-det. They also proposed a generalised version of collocations, where
certain classes of words are represented by a POS-tagged variable. For example, U-adj as-
prep represents a phrase that consists of a unique (occurring only once) adjective and the
preposition `as'. This generalised collocation matches phrases like `drastic as', `perverse
as' and `predatory as'.
Gamon (2004) analysed the eectiveness of linguistic features and found that part of
speech trigrams and an NP consisting of a pronoun followed by a punctuation character
were important for sentiment classication of customer reviews.
A broader context was used by Rilo et al. (2003). They created discourse features to
capture the density of sentiment indicators in the text surrounding a sentence. Pang and
Lee (2004) combined traditional bag-of-words features with inter-sentence level contextual
information in a minimum cut formulation.
Stylistic
Some studies have used stylistic attributes for sentiment analysis tasks. Wiebe et al.
(2004) used words that occurred only once (hapax legomena) to improve the accuracy of
subjectivity classication. They observed a signicantly higher presence of unique words
in subjective texts compared to objective documents in a Wall Street Journal corpus and
noted that \Apparently, people are creative when they are being opinionated". Gamon
(2004) used the length of constituents (sentence, clauses, adverbial/adjectival phrases, and
noun phrases) for sentiment classication of feedback surveys. Abbasi et al. (2008) used
a wide array of English and Arabic stylistic attributes including lexical, structural, and
function word style markers and reported high accuracy in blog sentiment analysis.
Feature Selection
Gamon (2004) describes a series of experiments for determining an optimal set of features
for the supervised sentiment polarity classication task. He tested three kinds of features:
linguistic features, surface features and word n-grams. The rst kind was obtained by
means of a tool that provided a phrase structure tree and a logical form for each string.
The second kind consisted of word n-grams, function word frequencies and POS ngrams.
Gamon observed that the presence of very abstract linguistic analysis features improves
the performance of the classiers and concluded that aect and style are linked in a more
the most detailed level of annotation.
2.4 Challenges of Sentiment Analysis
The ways in which opinions are expressed vary between languages and also within a
single language (so-called \domain-dependency").For example, the word horrible, in a
description of a plot of a horror lm does not necessarily bear any sentiment-related
meaning. However these word is a reliable indicator of negative sentiment in most other
domains (e.g. horrible performance). Turney (2002) observes that \for example, the
adjective \unpredictable", may have a negative orientation in an automotive review, in a
phrase such as \unpredictable steering" but it could have a positive orientation in a movie
review, in a phrase such as \unpredictable plot"". This problem is further complicated
by ambiguity of word meaning in dierent contexts. This problem was studied by Wilson
et al. (2005) who give an example of the word trust :
(1) Philip Clapp, president of the National Environment Trust...
The word trust, which has positive prior polarity, in this context has neutral meaning since
it is part of named entity.
Domain-dependency decreases the performance of classiers trained, or using data
from a dierent domain (Engstrom, 2004). Read (2005) also noted a temporal depend-
ency where even in the same domain people use dierent means of expressing sentiment
over time. A major current challenge is how to automatically extract sentiment inform-
ation from documents in dierent languages and in dierent domains. Most existing ap-
proaches are based on adapting systems designed for one language (or domain) to another.
Obviously, there are dierences between cultures, languages and even within a language
(consider the dierence between evaluations of company nancial prospects in a business
newspaper and reviews of a hard-rock festival in a participant's blog). Such dierences
make adaptation dicult.
2.4.1 Cross-Domain Approaches
Aue and Gamon (2005) try to overcome the problem of domain-dependency of sentiment
analysis by means of using labelled data from other domains. They investigate and com-
pare four approaches:
1. training on a mixture of labelled data from other domains where such data are
available;
2. training a classier as above, but limiting the set of features to those observed in
the target domain;
3. using ensembles of classiers from domains where there is available labelled data;
4. combining small amounts of labelled data with large amounts of unlabelled data in
the target domain. This approach does not use any out-of-domain data; instead,
it uses a generative Nave Bayes classier using the Expectation Maximization al-
gorithm.
The four approaches were tested on four dierent corpora: movie reviews, book reviews,
product support services and knowledge base web survey data. It was found that the
approaches that used some data from the target domain (approaches 3 and 4) performed
better than ones that used only out-of-domain training data (1 and 2). The best accuracy
was achieved by the last approach, which still requires (small) amounts of annotated in-
domain data.
Blitzer et al. (2007) describe another way of overcoming domain-dependency by means
of the adaptation of a classier trained in one domain to another. The authors raise the
problems of accuracy loss and domain similarity. The main idea underlying the approach
is Structural Correspondence Learning (SCL) developed by the authors in previous papers.
Since the authors use Mutual Information for nding new `pivot features' in unlabelled
domains, the full name of the approach is SCL-MI. The main intuition is that even when
key opinion words are completely distinct for each domain, if they have high correlation
with excellent and low correlation with awful in unlabelled data, then it is possible to
align them. The approach consists of three steps:
1. Using a labelled corpus from one domain and unlabelled corpora from both a new
domain and the old one, nd pivot features which occur frequently in both domains.
2. SCL models the correlations between the pivot features and all other features by
training linear pivot predictors to predict occurrences of each pivot in the unlabelled
data from both domains (Ando and Zhang, 2005; Blitzer et al., 2006). This is based
on the calculation of correlation (MI) of pivot features (such as excellent) and non-
pivot features (like fast, dual-core).
3. For some domains the features found are not well-aligned (thus not good enough for
sentiment classication). To correct misalignment the authors manually label 50 top
national versions of the WordNet lexicon to identify subjective expressions.
Boiy and Moens (2008) performed a number of machine learning experiments in sen-
timent analysis in Dutch, English and French. Although the experiments treated these
languages separately (no specic multi-lingual adaptation techniques were used), they
note language-specic particularities that aect sentiment analysis. The importance of
such language-specic features for multilingual processing is discussed by Bender (2009),
who argues that even approaches encoding little linguistic information can benet from
language-specic specialisation.
Chapter 3
Features for Chinese Sentiment
Classication1
There are some distinctive characteristics of the Chinese language that are known to aect
language processing. This chapter presents an investigation of these in connection with
sentiment classication. Section 3.1 outlines problems with conceptualising Chines text as
comprising a sequence of `words'. In particular, the problem of automatically segmenting
text into words is discussed and tested in an experiment. The diculty of splitting Chinese
text into words raises the issue of what kind of basic unit of processing to use in sentiment
analysis. Section 3.2 describes kinds of units to be experimented on and the data for
the experiments as well as basic concepts, algorithms and evaluation metrics. Section
3.3 reports experiments in sentiment classication and discusses the results. Section 3.4
describes extensions to the techniques presented previously and discuses the results. All
the experimental results are summarised in section 3.5.
3.1 The `Word' in Chinese Language Processing
One of the central problems in Chinese NLP in general and in Chinese sentiment analysis
in particular is what the basic unit of processing should be. The problem is caused by
a distinctive feature of the Chinese language: the absence of orthographically marked
word boundaries, while it is widely assumed that a word is of extreme importance for
computational language processing. The absence of word delimiters cannot be solved
by simply using dictionary lookup (or any other method) to segment a text into words,
1The experiments and part of the discussion in this chapter were presented in a condensed form at the
Student Workshop at the 45th Meeting of the Association for Computational Linguistics and at the 2007
EUROLAN Doctoral Consortium (Zagibalov, 2007a,b)
Accuracy Precision Recall F-Measure
NBm (Segmented) 83.59 0.84 0.84 0.84
NBm (Not segmented) 85.61 0.86 0.86 0.86
SVM (Segmented) 81.67 0.83 0.82 0.82
SVM (Not segmented) 85.50 0.86 0.86 0.86
Table 3.1: Results of sentiment classication of product reviews from the web-site IT168,
with and without segmentation
3.2 Words and Characters as Features for Sentiment Clas-
sication
In the absence of preliminary word segmentation, there are two possible types of feature
that could be used in Chinese sentiment classication: (vocabulary) words8 and characters.
This section reports experiments into these two types The experiments evaluate various
techniques that can facilitate classication including a simple negation check, as there is
no a general agreement as to whether feature is useful for sentiment classication. This
section also describes and tests an approach which divides the text into zones.
Processing based on words and characters are tested separately and in combination.
The latter approach is inspired by results published by Nie et al. (2000) who found that
for Chinese processing (IR in particular) the most eective kinds of features were a com-
bination of dictionary look up (using the longest-match algorithm) together with single-
character unigrams. Yuen et al. (2004) showed that Chinese characters constitute a dis-
tinct sub-lexical unit which, though having a smaller number of distinct types, has greater
linguistic signicance than words. Their experiments on sentiment classication of words
by means of characters proved to be eective, achieving a precision of 80.23% and a recall
of 85.03% with only 20 characters.
3.2.1 Basic Concepts
To introduce the approach I present some denitions of the concepts that are used in the
experiments.
8The notion of used is that of Vocabulary Word as dened by Li (2000) being the set of of vocabulary
items listed in a dictionary.
Basic Units
A basic unit is the smallest linguistic unit used for processing. In this Chapter I experiment
with two kinds of basic units: words and characters.
Word Noting the theoretical and practical diculty of word segmentation in the
Chinese language, I use the notion of `vocabulary word', which is any sequence of
characters that forms a vocabulary item in the NTU sentiment dictionary. To avoid
confusion, I will also use term `dictionary item' (DI) as a synonym of `vocabulary
word'.
Character A character is any Chinese character (hieroglyph), excluding punctu-
ation marks and other symbols (stars, bullet points etc.).
Classication Units
A classication unit is a contiguous segment of a document and can be either of the basic
units or a larger unit, as indicated below.
Unigram Unigram is a classication unit that consists of a single instance of a basic
unit.
Zone Zone is a classication unit that includes one or more basic units and usually
is a sub-sentence unit. Zones are delimited by any non-character symbol (comma,
full-stop, semicolon, quotation marks etc). If a sentence does not have any delimiters
except for the nal full-stop, the whole sentence is a zone. The idea of using zones
for classication comes from the observations that sentiment classication benets
from consideration of word context, but that sentences may contain two or more
opposite sentiments. Thus I decided to include a unit that is usually longer than a
word but smaller than a sentence.
Sentence Sentence is a sequence of basic units that ends with a full-stop, question
mark, exclamation mark or similar symbol that usually marks the end of a sentence.
Frequency
The sentiment score (see below) is based on a basic unit's relative (normalised) frequency:
Fa =
Na
N
(3.1)
where Na is the number of times a occurred in a collection of documents and N is the
total number of basic units (lexical units or characters, as appropriate) in the collection
of documents.
Sentiment Score
Each word (dictionary item) occurring in the positive side of the dictionary is assigned a
positive sentiment score of 1 and negative sentiment score 0, and vice versa for words in
the negative side.
Word Score The unsupervised approach does not suppose obtaining any data from
the test corpus. So initially all the words had a score 1 for the class (sentiment) they
present and 0 for the class they are not present.
Character Scores The characters for the experiments are extracted from the NTU
sentiment dictionary. Most of the characters occur in both sides of the dictionary:
positive and negative. The score for a character with respect to sentiment i (positive
or negative) is:
Sai =
Fi
Fj
(3.2)
where Fi is the unit's frequency in a document collection of sentiment i, Fj is the
character's relative frequency in the opposite side of the dictionary.
The experiments also test modied sentiment scores: scores with a low or zero
frequency `penalty' and presence-based binary scores. Apart from the sentiment
score as described above, the experiments test four score modications9
1. All characters were assigned the basic scores based on the relative frequency
calculations, but if Sai < 1, then Sa0i = Sai 1. The intuition is that if a
character is less frequent in one side of the dictionary than in the other, then
it should be `penalised' by being assigned a negative score.
2. If Sai > 0, then Sa0i = 1. This score is based on presence of a character in the
relevant side of the dictionary, regardless of its frequency.
3. If Sai 1, then Sa0i = 1, else Sa
0
i = 0. This score is a binary version of the
basic score.
9In the experiments the score modications are represented by the numbers 1, 2, 3, 4.
Basic Unit Kinds Unigram Zone Sentence
Chars 0.68 0.69 0.69
Chars 1 0.66 0.68 0.67
Chars 2 0.52 0.52 0.52
Chars 3 0.68 0.72 0.70
Chars 4 0.70 0.71 0.71
Words 0.87 0.88 0.88
Words and Chars 0.72 0.72 0.72
Words and Chars 1 0.69 0.70 0.70
Words and Chars 2 0.57 0.58 0.58
Words and Chars 3 0.74 0.76 0.75
Words and Chars 4 0.73 0.73 0.73
Table 3.6: Precision of the unigram, zone-based and sentence-based sentiment classiers
Words and Characters Words and characters when combined together performed
relatively well, showing the best features of both: accuracy was never too bad, and coverage
was fairly good. In unigram-based classication, three out of ve combinations (with the
basic score and modications 3 and 4) performed signicantly better (at 99% level) than
the other kinds of basic units, with the highest accuracy of 0.73 (see Table 3.2). The
combination of characters and words was able to classify many more documents than the
word-based classier (at least 86% against 77%). It is also worth noting that all character-
based classiers beneted from combination with words and performed better in all the
tests.
Classication Units
Another task of the experiments was to explore the in
uence of the classication unit
on classication performance. I compared the performance of the classiers based on
unigrams, zones and sentences.
Unigrams The highest accuracy achieved with unigram-based classication was 0.73
(characters combined with words), the average accuracy was 0.66 (0.67 if the lowest and
the highest results are excluded).
Zones The introduction of zones decreased performance signicantly: the highest ac-
curacy was achieved by the word-based classier (0.68) and average accuracy was 0.61.
Sentences The results of sentence-based classication are very close to zone-based: the
average was 0.62 with the top result being 0.67.
The results obtained from the experiments indicate that the best classier is one based
on the combination of words and characters. It is also possible to conclude that scoring
based on normalised frequency is better for Chinese sentiment classication than a binary
score. The presence-based binary score is not suitable for character-based classication,
but performs well with words. The results also suggest that for a sentiment classication
a unigram-based approach is the best.
3.4 Sentiment Score Extensions
Although the preliminary experiments reported above produced some promising results,
the characteristics of sentiment, and language more generally, suggest some possible ex-
tensions to the techniques which might lead to improved results. The extensions include
score calculation adjustments for negation, input data degree of skew and basic unit length.
This section presents the results of the experiments carried out using the same classier
as above (see Algorithm 1 and Algorithm 2) with the only dierence being in the score
calculation.
3.4.1 Negation Check
Negation plays an important role in language. It is also important in evaluative language,
as good and not good express dierent sentiments in most contexts. Most researchers agree
that including information about negation improves sentiment classication accuracy but
detecting and integrating this information may be a dicult task (see Section 2.2.2). In
this study the negation check is a very simple routine, based on regular expression patterns
to nd out if a word or a character is preceded by a negation up to 2 characters previously.
If a negation is found the score is multiplied by -1:
Sa0 = Sa 1 (3.3)
Accuracy
Basic Unit Kinds Overall Positive Negative Precision Coverage
Chars 0.66 0.73 0.58 0.75 0.88
Chars 1 0.67 0.81 0.53 0.76 0.88
Chars 2 0.48 0.02 0.93 0.51 0.93
Chars 3 0.66 0.55 0.78 0.76 0.87
Chars 4 0.67 0.67 0.68 0.76 0.88
Words 0.72 0.71 0.72 0.90 0.79
Words and Chars 0.69 0.74 0.64 0.78 0.89
Words and Chars 1 0.69 0.81 0.57 0.78 0.89
Words and Chars 2 0.54 0.12 0.95 0.59 0.91
Words and Chars 3 0.71 0.60 0.81 0.80 0.88
Words and Chars 4 0.72 0.71 0.72 0.78 0.89
Table 3.8: Results of zone-based sentiment classication with negation
Zone-Based Classication
The zone-based classication results (see Table 3.8) show the same kind of improvement:
all of the classiers improved their classication on the class on which they performed
worse in the previous experiments (see Table 3.4).
Sentence-Based Classication
Table 3.9 shows signicant improvements in sentence-based classication compared to clas-
sication without the negation check.
Overall, the experiments show that negation signicantly improved the performance
of all the classiers (except modication 2) by producing more balanced output. Another
notable dierence introduced by the negation check is a signicant improvement of the
word-based classier using zones: in previous experiments this classier did not show any
signicant variation in performance between the various classication settings (see Tables
3.2, 3.4 and 3.5).
Accuracy
Basic Unit Kinds Overall Positive Negative Precision Coverage
Chars 0.67 0.77 0.57 0.73 0.92
Chars 1 0.67 0.83 0.51 0.73 0.92
Chars 2 0.47 0.03 0.92 0.51 0.93
Chars 3 0.65 0.52 0.77 0.73 0.88
Chars 4 0.69 0.69 0.68 0.75 0.92
Words 0.69 0.69 0.69 0.89 0.78
Words and Chars 0.71 0.78 0.63 0.77 0.92
Words and Chars 1 0.70 0.83 0.56 0.75 0.92
Words and Chars 2 0.53 0.13 0.94 0.58 0.91
Words and Chars 3 0.70 0.59 0.81 0.78 0.90
Words and Chars 4 0.72 0.71 0.71 0.77 0.92
Table 3.9: Results of sentence-based sentiment classication with negation
3.4.2 Length Ratio
Unlike characters, words (dictionary items) have dierent lengths and can capture various
portions of context. For example, if a dictionary item covers most of a phrase a classier
can more reliably detect the phrase's sentiment. For example in the sentence (/
&
{(It's really neither sh nor fowl! ) there are two matching dictionary items in the
sentiment dictionary: ( (really) and
&
{ (neither sh nor fowl). The rst item
is in the positive side of the dictionary and the second is in the negative. If a classier
compares their scores (1 for positive and -1 for negative), then it will not be able to make
any decision, but if it were to compare their lengths (2 and 4) and combine this with their
scores (2 1 = 2 and 4 1 = 4), the whole sentence would be tagged negative.
A length-sensitive sentiment score can be dened as:
Score =
L2w
Lcu
(3.4)
where Lw is the length of a word and Lcu is the length of the relevant enclosing classication
unit. The numerator Lw is squared to in
uence importance of longer units.
Since all characters have length 1, there is no point in testing character-only classiers
in conjunction with the length ratio.
Seeds on their own cannot produce a good classication due to their small number.
Section 4.3 describes a way to overcome this problem by applying an iterative approach.
This section also tests two techniques for increasing the precision of the iterative classier:
ltering scores of found lexical units, to reduce the number of non-discriminative lexical
units and using dierence between positive and negative zones to rank classication results
by their reliability. Further classication accuracy improvements are based on extending
the unsupervised classier with supervised techniques: Nave Bayes (multinomial) and
Support Vector Machine. The machine-learning extension is based on using classication
data produced by an unsupervised classier to train supervised classiers.
Section 4.5 summarises the experimental results described in this Chapter.
4.1 Dictionary Adjustment
A major disadvantage of a generic sentiment dictionary is that it does not take into
account domain-specic ways of expressing sentiments. Quite often the same word might
have opposite meanings in dierent contexts (e.g. `unpredictable plot ' and `unpredictable
steering '). One possible solution is to assign domain-dependent sentiment scores to every
dictionary item. These scores would re
ect how an item is connected with sentiment in a
particular domain. This section presents experiments on dictionary adjustment by means
of calculating domain-dependent sentiment scores. The scores can be obtained from a
preliminary tagged corpus, but such an approach would no longer be unsupervised. To
keep the system unsupervised I used a classier described in the previous Chapter (Section
3.2.2) to extract a sentiment-classied subcorpus from a raw corpus. The most important
feature of such a subcorpus is precision (providing the recall is high enough) rather than
accuracy. As the experiments described in the previous chapter show, the highest precision
was achieved by a word-based classier with the negation check and using zones as the
unit of classication. This classier was used as the basis for the experiments described
in this Chapter.
4.1.1 Adjustment to Corpus
I used the classier to extract a subcorpus by labelling documents in the raw corpus accord-
ing to the classication results. The extracted subcorpus, consisting of 6447 documents
of which 3178 are classied as positive and 3269 are classied as negative, was used as a
training corpus in subsequent experiments. The corpus built using this data did not have
a very high accuracy (0.72), but it was balanced having similar number of positive and
Accuracy Precision Recall F-measure
Before adjustment 0.72 0.90 0.72 0.80
After adjustment 0.74 0.91 0.74 0.82
Table 4.2: Results of word-based sentiment classication before and after feature adjust-
ment
Accuracy Precision Recall F-measure
Before adjustment 0.79 0.79 0.79 0.79
After adjustment 0.83 0.83 0.83 0.83
Table 4.3: Results of combined classier sentiment classication before and after feature
adjustment
Accuracy Precision Recall F-measure
Before adjustment 0.72 0.90 0.72 0.80
After adjustment 0.74 0.91 0.74 0.81
Table 4.4: Average of the results of ve runs on a test corpus of the word classier
sentiment classication before and after feature adjustment
Corpus/product type Number of Reviews
Mobile phones 2317
Digital cameras 1705
MP3 players 779
Monitors 683
Oce equipment (copiers, multifunction devices, scanners) 611
Printers (laser, inkjet) 569
Computer peripherals (mice, keyboards, speakers) 457
Video cameras and lenses 361
Networking (routers, network cards) 350
Computer parts (CD-drives, motherboards) 308
Table 4.5: Product types and sizes of the test corpora.
Table 4.4 shows that words with adjusted scores perform slightly better (the improve-
ment is statistically signicant) than without.
4.1.2 Adjustment to Topic
The corpus used in the previous experiments consisted of customer reviews of consumer
electronics of dierent kinds. This provides me an opportunity to split the corpus into
dierent topic-based subcorpora (topics for short) and test the ability of the approach to
nd topic-dependent scores for the items in the sentiment dictionary. The experiments
presented below used the same corpus as described in Section 3.1.2, but in order to to
extract domain-specic scores, the corpus was split into 10 topics (see Table 4.5).
Five of the corpora combine smaller ones of 100{250 reviews each (as indicated in
parentheses in Table 4.5) in order to have reasonable amounts of data in each. Each
corpus has equal numbers of positive and negative reviews so that it is possible to derive
strong comparator accuracy gures by applying supervised classiers3 (studying the eect
of skewed class distributions is out of the scope of this study).
Table 4.6 compares the results of two classications. The left side of the table presents
the results of classication using the sentiment dictionary without any topic-specic ad-
justment. The right side contains results of classication using the same dictionary but
with scores calculated on the basis of the extracted subset of documents. Although all
3This corpus is publicly available at http://www.informatics.sussex.ac.uk/users/tz21/
Corpus No Scores Scores
P R F P R F
Mobile phones 0.87 0.71 0.78 0.87 0.72 0.79
Digital cameras 0.88 0.63 0.74 0.87 0.64 0.74
MP3 players 0.90 0.71 0.79 0.89 0.72 0.80
Monitors 0.87 0.71 0.78 0.87 0.74 0.80
Oce equipment 0.90 0.72 0.80 0.87 0.74 0.80
Printers 0.90 0.71 0.79 0.88 0.71 0.79
Computer peripherals 0.93 0.79 0.85 0.91 0.81 0.86
Video 0.90 0.75 0.82 0.86 0.73 0.79
Networking 0.85 0.65 0.74 0.83 0.68 0.74
Computer parts 0.84 0.65 0.73 0.82 0.62 0.71
Macroaverage 0.88 0.70 0.78 0.87 0.71 0.78
Table 4.6: Classication results of dierent topics with the sentiment vocabulary with
(Scores) and without topic-adjusted scores (No Scores). P is precision, R is recall, F is
F-measure. Dierence in the results for all corpora is statistically signicant.
the results are signicantly dierent (in terms of the paired t-test) there is only a slight
increase in recall at the expense of precision.
4.1.3 Discussion
Calculating domain-specic scores for lexical items improved performance across the cor-
pus but only marginally altered results of classication of the same corpus split into sep-
arate topics. This may be due to the generic nature of the dictionary: it contains only
generic indicators of sentiment and is missing a lot of domain- and topic-specic ones.
Thus a larger corpus has a better chance to improve performance with this generic sen-
timent dictionary as its items occur more frequently than in a small corpus. But if the
same collection is split into topical corpora where the role of domain-relevant words is
more important (the smaller collection is the more important every lexical unit becomes)
then a generic dictionary fails to improve even after being adjusted with domain-related
scores. Another important feature of a sentiment corpus is its topical coherence. The more
closely related (in terms of the topic) documents are, the more important topic-related
words may be and the smaller the improvement one can expect with a generic sentiment
dictionary. This explains why the generic dictionary performed better on a more generic
corpus compared to the smaller more topic-oriented collections extracted from it.
4.2 Vocabulary Extraction
The experiments in the previous section suggest that a generic sentiment dictionary has
limited potential to improve performance even with domain-specic scores used for ad-
justment of the dictionary item scores. If it is not possible to substantially increase per-
formance by adjusting an existing generic dictionary then the next possibility to explore
is creating domain-specic vocabularies.
4.2.1 Seed-Based Approach
Although the experiments described above suggest that classication results can poten-
tially be improved by adjusting the vocabulary to the domain, the in
exibility of the
precompiled vocabulary prevents it from full adjustment to a domain. Moreover, the
vocabulary-based approach prevents a system from being multilingual as the very need
for a comprehensive dictionary inevitably makes the system language-dependent. Another
problem of the dictionary-based approach is that it is virtually impossible to include all
important domain-related words. One way to solve the problem may be nding domain-
related lexical units from a subcorpus which was extracted by an unsupervised classier
and calculating their sentiment scores for a given topic. This would pave the way to creat-
ing a domain-specic vocabulary to be used for classication. But this technique requires
extraction of a subcorpus from a corpus to be classied so that words can be extracted
from it and scores calculated for them. Such a subcorpus is a product of classication
that needs some input data to start with. This input could be several lexical units (seeds)
used for initial classication and extraction of the subcorpus.
Seeds
The experiments below test a number of seeds, which were selected intuitively without
any special preliminary study of their potential eectiveness for the task of sentiment
classication. This approach is justied by the unsupervised paradigm of the research, as
any `learned' data would contradict it. Two types of seed word lists were investigated: six
one-word seed lists (see Table 4.7) and three multi-word seed lists consisting of the single
seeds in various combinations (see Table 4.8). All the seeds had their sentiment scores
set to 1 and the classier was run with the seed lists taking the place of the sentiment
Corpus good allPOS all
P R F P R F P R F
Mobile phones 0.77 0.27 0.40 0.81 0.32 0.46 0.85 0.41 0.55
Digital cameras 0.76 0.19 0.30 0.80 0.24 0.37 0.86 0.35 0.50
MP3 players 0.77 0.21 0.33 0.83 0.28 0.42 0.88 0.35 0.50
Monitors 0.68 0.22 0.34 0.73 0.28 0.41 0.79 0.34 0.47
Oce equipment 0.81 0.22 0.35 0.86 0.31 0.45 0.89 0.39 0.55
Printers 0.76 0.20 0.31 0.80 0.27 0.40 0.86 0.33 0.48
Computer peripherals 0.71 0.24 0.36 0.75 0.30 0.43 0.79 0.35 0.48
Video cameras and lenses 0.75 0.19 0.31 0.82 0.29 0.43 0.87 0.36 0.51
Networking 0.63 0.21 0.31 0.67 0.25 0.37 0.75 0.31 0.44
Computer parts 0.69 0.18 0.28 0.73 0.21 0.32 0.81 0.30 0.44
Macroaverage 0.73 0.21 0.33 0.78 0.28 0.41 0.84 0.35 0.49
Dierence -0.02 -0.02 -0.02 -0.02 -0.01 -0.01 -0.01 -0.02 -0.02
Table 4.10: Classication results with the seed good, and seed lists allPOS and all. P
is precision, R is recall, F is F-measure. Dierence shows the change in performance
compared with the corpus-wise classication (see Table 4.9). The dierences in the results
for all seed lists are statistically signicant.
Lexical Unit
As discussed in the previous chapter (Section 3.1.1), the concept of `word' segmentation
in Chinese NLP and so the term `seed word' is not very accurate since it is not possible to
guarantee that extracted units will always form words in the normally understood sense.
Fortunately, the results of the experiments with dierent kinds of features (Section 3.5.1)
showed that high accuracy can be achieved by a combination of both words and characters,
which makes it possible not to use words as basic units. Instead, I use lexical units
which could be any combination of characters constituting parts of words, words or even
phrases. This approach avoids the need for word segmentation, and can also capture some
grammatical and syntactic information, because lexical units can incorporate grammar
words and parts of grammatical constructions. Example (1) shows a combination of two
words that was extracted as one unit. This unit provides a context for each of its two
members and potentially is a better indicator of sentiment than either of them on their
own. The lexical unit in Example (2) consists of two function words, the rst being a
grammar word with quite a complex meaning (mostly related to the sentence level) and
a modal verb. Separately these two words have no relation to sentiment but combined
together they are often used to show that something can be easily done or improved, which
relates to sentiment. Example (3) comprises a combination of a negated modal verb with
the rst part of a number of words with meaning \setting up; switching to" (e.g. ¾n
{ install, set up; ¾ { set to (some value); ¾:ê¨ { switch to an automatic mode).
Thus the unit is capable of representing a whole set of similar phrases that describe the
inability of a device or a piece of software to perform a certain action, which most probably
expresses negative sentiment. This unit has also advantage of being more frequent than
any of the full forms. To avoid confusion in what follows I will use the term `lexical unit'
(LU) rather than `word'. In the context of these experiments the term `seed' means a LU
used as a seed.
(1) Â }
appearance good
the appearance is good
(2) 1 ïå
already can
OK; has become possible
(3)
ý ¾
not able set . . .
not able to set . . .
Lexical Unit Extraction To nd lexical units that are candidates for being seeds, the
process starts by looking for the longest character sequences that occur in any two zones
across all documents in the corpus (using the Longest Common Substring algorithm).
Although the process is computationally quite expensive it needs be run only once5. The
application of this approach to the corpus produced more than 121 thousand lexical units.
The list was ltered to exclude non-character symbols (digits, Latin chars, hyphens, but
other in-word symbols were preserved). To reduce the list, all lexical units that occurred
less than 10 times in the corpus were excluded. The nal version of the lexical item list
comprised 5492 items.
5If eciency were to be an issue, the corpus could be represented as sux tree to facilitate faster
extraction of lexical units that reoccur.
Corpus Seed Corpus Seed
Monitors } (good) Video
cameras
and
lenses
p (clear - of sound or image)
¿ (convenient; cheap) ¹¿ (comfortable)
p (clear) ( (practical)
ô (straight) ó (perfect)
¹¿ (comfortable) = (cool)
á (ll, fulll)
) (sharp)
(comfortable)
= (cool)
Mobile
phones
} (good) Digital
cameras
} (good)
/ (support) ¿ (convenient; cheap)
¿ (convenient; cheap) ¹¿ (comfortable)
¹¿ (comfortable) p (clear - of sound or image)
p (clear - of sound or image) (special)
³ (sucient) = (cool)
}( (easy to use) á (satised)
(comfortable) ( (durable)
º' (user friendly)
(comfortable)
AE (smooth and easy) ó (perfect)
Z (distinct) (straight)
= (cool) 3 (stable)
} (has become better) ¹¿ (has become comfortable)
( (durable) ¢ (polite)
¹¿ (comfortable) æÆ (detailed)
á (satised)
(t, suit)
¹¿ (has become comfortable)
( (applicable)
zK (handy)
Ñf (science, scientic)
Networking 3 (stable) Printers } (good)
MP3
players
} (good) Computer
peripherals
} (good)
¿ (convenient; cheap) ¿ (convenient;cheap)
¹¿ (comfortable) ¹¿ (comfortable)
( (practical) Æ (precise)
uO (sensitive)
(comfortable)
(comfortable) `ï (habitual)
= (cool) AE (smooth and easy)
¹¿ (has become comfortable) 3 (stable)
Computer
parts
} (good) Oce
equipment
} (good)
3 (stable) ¹¿ (comfortable)
3 (stable)
( (practical)
Table 4.11: Seeds automatically identied for each corpus.
Corpus Only Positive Pos & Neg all Seed List
P R F P R F P R F
Mobile phones 0.86 0.51 0.64 0.89 0.57 0.70 0.85 0.41 0.55
Digital cameras 0.82 0.35 0.49 0.88 0.45 0.60 0.86 0.35 0.50
MP3 players 0.83 0.34 0.48 0.87 0.41 0.55 0.88 0.35 0.50
Monitors 0.74 0.43 0.55 0.80 0.48 0.60 0.79 0.34 0.47
Oce equipment 0.86 0.34 0.49 0.90 0.43 0.58 0.89 0.39 0.55
Printers 0.76 0.20 0.31 0.84 0.26 0.40 0.86 0.33 0.48
Computer peripherals 0.79 0.41 0.54 0.83 0.45 0.58 0.79 0.35 0.48
Video cameras and lenses 0.93 0.28 0.43 0.94 0.37 0.53 0.87 0.36 0.51
Networking 0.92 0.18 0.30 0.93 0.27 0.42 0.75 0.31 0.44
Computer parts 0.76 0.28 0.41 0.82 0.37 0.51 0.81 0.30 0.44
Macroaverage 0.83 0.33 0.46 0.87 0.41 0.55 0.84 0.35 0.49
Table 4.14: Classication results with only positive extracted seeds (Only Positive), the
same seeds augmented with generic negative seeds (Pos & Neg) and all seed list (all Seed
List). P is precision, R is recall, F is F-measure. For all corpora the dierences between
the results for all corpora are statistically signicant except for those marked with .
4.2.3 Iterative Approach
In the context of real-world applications, most of the results presented in the previous
experiments would probably be acceptable in terms of precision; however they are very
low in recall, especially compared to the vocabulary-based classier described earlier. This
means that the seeds on their own are not sucient and the classier needs more lexical
units with appropriately calculated scores to perform better.
One way of extracting more lexical units from the corpus is to run the classier iterat-
ively. Each new iteration uses a subset consisting of classied documents from the corpus
for extracting new lexical units and calculating their scores. The newly found set of lexical
units with scores assigned is then used for creating a new set of classied documents that
form a new subset for the next iteration (see Algorithm 5).
Iteration Stopping Criterion
An iterative approach requires a way to control the number of iterations. I used a goal
driven stopping criterion which means that iterations should stop once the goal is achieved.
Mobile phones Monitors
Iter P R F C P R F C
1 0.86 0.41 0.56 1209 0.79 0.34 0.48 386
2 0.87 0.80 0.83 189 0.83 0.76 0.79 57
3 0.86 0.80 0.83 157 0.85 0.80 0.82 34
4 0.85 0.80 0.82 156 0.83 0.79 0.81 33
5 0.85 0.79 0.82 158 0.83 0.80 0.81 29
6 0.85 0.79 0.81 163 0.83 0.79 0.81 29
7 0.84 0.79 0.81 157 0.83 0.80 0.81 31
8 0.84 0.78 0.81 162 0.83 0.80 0.82 30
Table 4.15: Results of sentiment classication of 10 iterations with seed list all applied to
two topics Mobile phones and Monitors. Iter is the number of iterations, P is precision,
R is recall, F is F-measure; C is the number of documents that were NOT classied.
Classication Results: Over the whole Corpus
The next set of experiments tests the performance of the same set of seeds as presented
in Section 4.2.1 on the whole corpus but using the iterative technique. After a number
of iterations the classier produced good results with positive seeds (see Table 4.16) com-
pared to the non-iterative classier (Table 4.9). The most signicant progress was made
in overall accuracy of classication, but the results are also less skewed. The best results
were were for group of seeds all. All the other positive seeds also performed quite well re-
gardless of how many seeds there were in the list. In contrast, all negative seeds performed
poorly, barely improving over the nave baseline. The reason for this is a very unbalanced
classication: almost all documents get tagged as positive, which results in near-baseline
performance. The skew towards positive classication (which is not expected from the
negative seeds) is the result of the skew towards negative classications during the rst
iteration: the extracted subcorpus contains many more negative documents compared to
positive ones, which aects extraction of lexical units and score calculation for them. The
system extracts too many negative lexical units with very low scores (because there are
too many documents classied as negative) and several high frequency supposedly positive
lexical units (with high scores as the number of positive documents is low). This leads
to a skew towards positive classication in subsequent iterations. This suggests that such
classications should be avoided when the iteration control chooses the best iteration and
Seed list name P R F Acc AccP AccN Iterations
good 0.79 0.72 0.75 0.72 0.77 0.68 9
very good 0.77 0.71 0.74 0.71 0.74 0.68 12
comfortable 0.78 0.72 0.75 0.72 0.73 0.71 5
bad 0.53 0.50 0.52 0.50 0.94 0.06 2
too bad 0.51 0.49 0.50 0.49 0.98 0.01 2
poor 0.54 0.50 0.52 0.50 0.93 0.07 2
allPOS 0.79 0.72 0.75 0.72 0.77 0.68 10
allNEG 0.55 0.51 0.53 0.51 0.93 0.09 2
all 0.85 0.78 0.81 0.78 0.81 0.75 3
Table 4.16: Results of sentiment classication after iterations. P is precision, R is recall, F
is F-measure; Acc is accuracy, AccP is accuracy of the positive class and AccN is accuracy
of the negative class.
that the iteration control should be extended with a skew-control rule.
Skew Control The motivation behind skew control is to prevent a classier from pro-
ducing highly skewed classications. To do so, the skew control needs some approximate
`idea' of what a balanced classication is. Such a `gold standard' can be provided by the
rst (seed-only) iteration:
G =
min(Ci; Cj)
max(Ci; Cj)
(4.1)
where G is the `gold standard' for the balance, and Ci and Cj are the number of classied
documents of each class (either positive or negative). During the iterative classication
procedure, if the classication skew deviates from G then the iterations are stopped.
This means that the skew control uses the balance of the initial classication to compare
with all subsequent classications. However, if the system uses the exact value of the `gold
standard' (which is likely not to be perfect), then good classications which are slightly
dierent in balance will be regarded as skewed and thus ignored. For this reason the system
in fact does not use a strict comparison but instead use a `window' of 50%. For example,
if the initial iteration classied 100 positive documents and 100 negative documents, then
the `gold standard' would be 1; an acceptable balance should be at least 0.5 (a smaller
class can be half of the size of the bigger one). So if the next classication nds 100
Seed list name Top 10 words in positive list
good Í\ (control is (easy)), Zå¾ (carefully made),
If (x optics)
w ((it) has), (
(quality is rather good)
IfØ& (x optical zoom), Í\U (easy control), 5ñø (5 inch)
Hú (output), ý0Ì (rich in features)
very good Ð (supplied, provided), Í\ (control is (easy)), Zå¾ (carefully made)
If (optics),
IfØ& (x optical zoom), w ((it) has)
(
(quality is rather good), 5ñø(5 inch), Í\U (easy control), DVD+
comfortable
If (optics),
IfØ& (x optical zoom), Hú (output)
Ð (supplied, provided), 8úr ([extrem]emly outstanding)
^8ú (extremely out[standing]), dpi, (feel comfortable)
Húr (outstanding output), Z徯 (carefully made)
bad CRT, Ù>ó± (these speakers), 8- (during the game)
>:h ((of) monitor), >Ï¡ (CRT)
Nó® (subwoofer), U1 ((some) distortion), (8 (in the game)
àU1(geometric distortion), k± (satellite speakers)
too bad Ç( (used), w ((it) has), riØ (colour reduction), ¾¡ (visual design)
¾¡ (design), Ð (supplied, provided), Ç( ((it) uses)
IfØ& (optical zoom), ý: (reach in features), Í\ (control is (easy))
poor Ç( (used), w ((it) has), ¾¡ (visual design), ¾¡ (design)
Ð (supplied, provided), IfØ& (optical zoom), ý: (rich in features)
Í\ (control is (easy)), Ç(((it) uses), ý:' (rich in features)
allPOS Zå¾ (carefully made),
If (x optics),
IfØ& (x optical zoom)
(
(quality is rather good), 5ñø (5 inch), Í\U (easy control)
Hú (output), ó(
(good sound quality)
ýPh (full of features), w ((it) has)
allNEG
If (x optics),
IfØ& (x optical zoom)
ýPh (full of features), (feel comfortable)
Hú (output), ^8ú (extremely out[standing]), (control (of)), dpi
8úr ([extr]emly outstanding), Húr (outstanding output)
all Zå¾ (carefully made),
If (x optics),
IfØ& (x optical zoom)
Ð (supplied, provided), (
(quality is rather good), 5ñø (5 inch)
Í\U (easy control), ýPh (full of features), w ((it) has), Hú (output)
Table 4.17: Top 10 positive lexical units found on completion of iterations.
Corpus allPOS Extracted
P R F P R F
Mobile phones 0.82 0.76 0.79 0.86 0.80 0.83
Digital cameras 0.74 0.66 0.70 0.74 0.67 0.70
MP3 players 0.76 0.71 0.74 0.75 0.70 0.72
Monitors 0.81 0.77 0.79 0.81 0.78 0.79
Oce equipment 0.79 0.71 0.75 0.80 0.73 0.76
Printers 0.80 0.73 0.76 0.75 0.68 0.72
Computer peripherals 0.61 0.56 0.58 0.61 0.57 0.59
Video cameras and lenses 0.67 0.63 0.65 0.50 0.47 0.48
Networking 0.68 0.25 0.37 0.81 0.72 0.76
Computer parts 0.55 0.51 0.53 0.50 0.46 0.48
Macroaverage 0.72 0.63 0.67 0.71 0.66 0.68
Table 4.19: Classication results with allPos seed list and only positive extracted seeds
Extracted. P is precision, R is recall, F is F-measure. Dierences between the two sets of
results are statistically signicant except for the corpora marked with .
formed better in terms of recall but precision was almost the same as that of the generic
seeds (see Table 4.19). In two topics (Computer parts and Video) the extracted seeds failed
to perform better than the nave baseline, and the generic seeds failed to do so in topics
Networking and Computer parts. The result of classication of the topic Networking illus-
trates the importance of a seed's domain-relevance: only one extracted seed outperformed
three generic ones. However in the topics Video and Computer parts generic seeds per-
formed better. The performance of the extracted seeds was most probably compromised
by a small size of these two topic corpora (only 361 and 308 documents respectively, see
Table 4.5) and that the collections combined reviews of related but nevertheless dierent
items (video cameras and lenses; CD-drives and motherboards). But on a big topic such as
Mobile phones the extracted seeds performed much better, mostly due to a large number
of extracted seeds (21 lexical units, see Table 4.11).
Another comparable pair of seed lists are the all seed list and the extracted seeds
combined with generic negative seeds (the same as the ones in all). Negative seeds helped
both of the seed lists to increase performance, but the generic seeds gained more compared
to the extracted ones (see Table 4.20). Although slightly better in recall, the generic seeds
Corpus all ExtractedNeg
P R F P R F
Mobile phones 0.85 0.80 0.82 0.89 0.83 0.86
Digital cameras 0.82 0.74 0.77 0.81 0.73 0.77
MP3 players 0.81 0.75 0.78 0.79 0.73 0.76
Monitors 0.83 0.80 0.81 0.83 0.80 0.81
Oce equipment 0.81 0.75 0.78 0.83 0.76 0.80
Printers 0.82 0.75 0.78 0.82 0.75 0.78
Computer peripherals 0.82 0.78 0.80 0.84 0.79 0.81
Video cameras and lenses 0.77 0.73 0.75 0.70 0.66 0.68
Networking 0.75 0.31 0.44 0.83 0.72 0.77
Computer parts 0.67 0.63 0.65 0.67 0.63 0.65
Macroaverage 0.80 0.70 0.74 0.80 0.74 0.77
Table 4.20: Classication results with generic seeds (all) and extracted seeds combined
with generic negative seeds (ExtractedNeg). P is precision, R is recall, F is F-measure.
are similar in terms of precision. Again, similarly to the previous experiments, on a large
document collection (Mobile phones) the extracted seeds performed much better than the
generic ones. Both classiers performed well (much higher than the nave baseline) on all
of the topics, which conrms the importance of negative seeds.
4.2.4 Discussion
The experiments presented above showed that although features (vocabulary) adjusted
to the domain produce better sentiment classication, a vocabulary-based approach has
limited ability to adapt to a domain: it is not possible to foresee all possible sentiment-
bearing lexical units in all possible domains. An alternative approach, based on using
seeds for classication proved to be eective when used with multiple iterations. All
seeds consisting of both positive and negative lexical units managed to bootstrap a better
vocabulary from the corpus than the extracted ones. The biggest disadvantage of the latter
is absence of negative lexical units. However, augmented with generic negative seeds, the
extracted seeds performed quite well in terms of recall, especially on large document
collections. Generally, iterations allow the bootstrapping of a domain-related sentiment
vocabulary which in some cases performs better than the generic sentiment vocabulary
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


