Adaptive Subjective Triggers for Opinionated Document Retrieval
Building (2009)
- ISBN: 9781605583907
- DOI: 10.1145/1498759.1498805
Available from portal.acm.org
or
Abstract
- A statistical language model to opinionated document retrieval. + The trigger model, with the assumption that there are two constituents to form a subjective opinion. The object of the opinion + The subjective expression
Author-supplied keywords
Available from portal.acm.org
Page 1
Adaptive Subjective Triggers for Opinionated Document Retrieval
Opinionated Document Retrieval Using Subjective
Triggers
Kazuhiro Seki
Organization of Advanced Science and Technology, Kobe University, 1-1 Rokkodai, Nada,
Kobe 657-8501, Japan. E-mail: seki@cs.kobe-u.ac.jp
Kuniaki Uehara
Graduate School of System Informatics, Kobe University, 1-1 Rokkodai, Nada, Kobe 657-8501, Japan.
E-mail: uehara@kobe-u.ac.jp
This article proposes a novel application of a statisti-
cal language model to opinionated document retrieval
targeting weblogs (blogs). In particular, we explore the
use of the trigger model—originally developed for incor-
porating distant word dependencies—in order to model
the characteristics of personal opinions that cannot be
properly modeled by standard n-grams. Our primary
assumption is that there are two constituents to form a
subjective opinion. One is the subject of the opinion or
the object that the opinion is about, and the other is a sub-
jective expression; the former is regarded as a triggering
word and the latter as a triggered word. We automati-
cally identify those subjective trigger patterns to build
a language model from a corpus of product customer
reviews. Experimental results on the Text Retrieval Con-
ference Blog track test collections show that, when used
for reranking initial search results, our proposed model
significantly improves opinionated document retrieval. In
addition, we report on an experiment on dynamic adapta-
tion of the model to a given query,which is found effective
for most of the difficult queries categorized under poli-
tics and organizations.We also demonstrate that,without
any modification to the proposed model itself, it can be
effectively applied to polarized opinion retrieval.
Introduction
Since the advent of theWeb, many forms of user-generated
contents (UGC) have evolved, including personal home-
pages, discussion boards, and weblogs (blogs). Such UGC
typically contains subjective opinions of individual authors
which are difficult to find in the conventional mass media,
such as magazines and newspapers.Among them, blogs have
seen popularity as a means to express personal opinions
Received January 12, 2010; revised January 06, 2011; accepted January 6,
2011
© 2011 ASIS&T • Published online 24 February 2011 in Wiley Online
Library (wileyonlinelibrary.com). DOI: 10.1002/asi.21502
regarding politics, hobbies, people, etc., due to the ease of
use and maintenance. Because of its wide acceptance among
the general public, blogs have been drawing much atten-
tion from natural language processing (NLP), information
retrieval (IR), and other research communities as an attractive
domain for exploration (Adar &Adamic, 2005;Agarwal, Liu,
Tang, &Yu, 2008; Ding, Liu, &Yu, 2008; Esuli & Sebastiani,
2007; Mei, Ling, Wondra, Su, & Zhai, 2007).
Among a variety of research opportunities targeting blogs,
this article focuses on opinionated document (blog post)
retrieval, a task to retrieve blog posts not only relevant to a
user query but also containing subjective opinions of authors.
Opinionated document retrieval has been challenged by many
researchers, partly motivated by the Text Retrieval Confer-
ence (TREC) Blog track (Macdonald, Ounis, & Soboroff,
2007; Ounis, de Rijke, Macdonald, Mishne, & Soboroff,
2006; Ounis, Macdonald, & Soboroff, 2008). Previous works
by the track participants and others can be roughly catego-
rized into lexicon-based (Lee et al., 2008; Mishne, 2006;
Oard et al., 2006; Vechtomova, 2010; Zhang & Ye, 2008)
and classification-based (Gerani, Carman, & Crestani, 2009;
Zhang & Yu, 2006; Zhang, Yu, & Meng, 2007). Briefly,
the former uses a manually or automatically compiled list
of words, such as “like” and “fantastic,” and in essence
assumes the existence of those words in a document as an
indicator of opinions. The latter, classification-based, also
typically relies on word occurrences but automatically creates
a classifier based on positive (i.e., opinionated) and nega-
tive (i.e., nonopinionated) examples using machine-learning
algorithms.
In this article, we propose a simple but an effective
approach to opinionated document retrieval (or opinion
retrieval for short), which does not belong to either category.
Our approach was in part inspired by the empirical finding
that considering the proximity of pronouns and subjective
expressions to objects improves opinion retrieval (Zhou,
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 62(5):861–876, 2011
Triggers
Kazuhiro Seki
Organization of Advanced Science and Technology, Kobe University, 1-1 Rokkodai, Nada,
Kobe 657-8501, Japan. E-mail: seki@cs.kobe-u.ac.jp
Kuniaki Uehara
Graduate School of System Informatics, Kobe University, 1-1 Rokkodai, Nada, Kobe 657-8501, Japan.
E-mail: uehara@kobe-u.ac.jp
This article proposes a novel application of a statisti-
cal language model to opinionated document retrieval
targeting weblogs (blogs). In particular, we explore the
use of the trigger model—originally developed for incor-
porating distant word dependencies—in order to model
the characteristics of personal opinions that cannot be
properly modeled by standard n-grams. Our primary
assumption is that there are two constituents to form a
subjective opinion. One is the subject of the opinion or
the object that the opinion is about, and the other is a sub-
jective expression; the former is regarded as a triggering
word and the latter as a triggered word. We automati-
cally identify those subjective trigger patterns to build
a language model from a corpus of product customer
reviews. Experimental results on the Text Retrieval Con-
ference Blog track test collections show that, when used
for reranking initial search results, our proposed model
significantly improves opinionated document retrieval. In
addition, we report on an experiment on dynamic adapta-
tion of the model to a given query,which is found effective
for most of the difficult queries categorized under poli-
tics and organizations.We also demonstrate that,without
any modification to the proposed model itself, it can be
effectively applied to polarized opinion retrieval.
Introduction
Since the advent of theWeb, many forms of user-generated
contents (UGC) have evolved, including personal home-
pages, discussion boards, and weblogs (blogs). Such UGC
typically contains subjective opinions of individual authors
which are difficult to find in the conventional mass media,
such as magazines and newspapers.Among them, blogs have
seen popularity as a means to express personal opinions
Received January 12, 2010; revised January 06, 2011; accepted January 6,
2011
© 2011 ASIS&T • Published online 24 February 2011 in Wiley Online
Library (wileyonlinelibrary.com). DOI: 10.1002/asi.21502
regarding politics, hobbies, people, etc., due to the ease of
use and maintenance. Because of its wide acceptance among
the general public, blogs have been drawing much atten-
tion from natural language processing (NLP), information
retrieval (IR), and other research communities as an attractive
domain for exploration (Adar &Adamic, 2005;Agarwal, Liu,
Tang, &Yu, 2008; Ding, Liu, &Yu, 2008; Esuli & Sebastiani,
2007; Mei, Ling, Wondra, Su, & Zhai, 2007).
Among a variety of research opportunities targeting blogs,
this article focuses on opinionated document (blog post)
retrieval, a task to retrieve blog posts not only relevant to a
user query but also containing subjective opinions of authors.
Opinionated document retrieval has been challenged by many
researchers, partly motivated by the Text Retrieval Confer-
ence (TREC) Blog track (Macdonald, Ounis, & Soboroff,
2007; Ounis, de Rijke, Macdonald, Mishne, & Soboroff,
2006; Ounis, Macdonald, & Soboroff, 2008). Previous works
by the track participants and others can be roughly catego-
rized into lexicon-based (Lee et al., 2008; Mishne, 2006;
Oard et al., 2006; Vechtomova, 2010; Zhang & Ye, 2008)
and classification-based (Gerani, Carman, & Crestani, 2009;
Zhang & Yu, 2006; Zhang, Yu, & Meng, 2007). Briefly,
the former uses a manually or automatically compiled list
of words, such as “like” and “fantastic,” and in essence
assumes the existence of those words in a document as an
indicator of opinions. The latter, classification-based, also
typically relies on word occurrences but automatically creates
a classifier based on positive (i.e., opinionated) and nega-
tive (i.e., nonopinionated) examples using machine-learning
algorithms.
In this article, we propose a simple but an effective
approach to opinionated document retrieval (or opinion
retrieval for short), which does not belong to either category.
Our approach was in part inspired by the empirical finding
that considering the proximity of pronouns and subjective
expressions to objects improves opinion retrieval (Zhou,
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 62(5):861–876, 2011
Page 2
Joshi, & Bayrak, 2007). We take advantage of statistical lan-
guage models for capturing such characteristic patterns often
seen in opinionated documents. In particular, we explore the
use of the classic trigger model (Lau, Rosenfeld, & Roukos,
1993; Tillmann & Ney, 1996), which was originally proposed
for dealing with long-distance word dependencies. Our pri-
mary assumption is that there are two essential constituents
to form a personal or subjective opinion. One is the sub-
ject of the opinion (e.g., “I”) or the object that the opinion is
about (e.g., “this movie”), and the other is a subjective expres-
sion (e.g, “like”). We regard the former as a triggering word
and the latter as a triggered word and automatically identify
trigger patterns characteristic to subjective opinions using
customer reviews collected fromAmazon.com. Through sev-
eral experiments on the TREC Blog track test collections, it
is demonstrated that, when used for reranking, our proposed
model significantly improves the IR system performance and
that dynamically adapting the model to a given query gives
steady improvement. Also, it is shown that our approach can
be easily extended to polarized document retrieval, which
distinguishes positive and negative opinions.
In the remainder of this article, we first detail our approach
to building a trigger model for subjective opinions. Then, we
evaluate the validity of our proposed model and its effective-
ness in retrieving opinionated blog posts by way of a variety
of experiments on the Blog track test collections. After that,
we summarize the related work. Finally, we conclude this
paper with a brief summary of the findings and possible future
directions.
Opinion Retrieval by a Trigger Model
Motivation
To judge whether a given document contains subjective
opinions, the simplest and most intuitive approach would be
to look for subjective words in the document. The underlying
assumption of this kind of lexicon-based approaches is that
if a document contains words often used for expressing sub-
jectivity, it is likely to be opinionated. For instance, “like”
may be a good indicator for favorable feelings. Along this
line, many researchers manually or automatically created a
sentiment-oriented word list or dictionary to use for identi-
fying opinions (for example, Lee et al., 2008; Vechtomova,
2010; Yang, Yu, Valerio, & Zhang, 2006). Although reported
effective, a potential limitation of this approach is, as opposed
to the intuition, that a document with such subjective words is
not necessarily opinionated. For example, “It looks like a cat.”
or “She likes singing.” may possibly be a subjective opinion
of the writer but is rather objective; the former uses “like” as
a preposition and the latter is a statement or a fact about a
third person.To distinguish such a difference, one would need
to look at wider context wherein those potentially subjective
words occur.
One way to consider wider context is to use the classic
n-gram language models (Manning & Schütze, 1999), which
estimate the probability of a word occurrence based on the
prior local context. Basically, it treats n consecutive terms as
a unit of analysis. For example, bigrams in the above sentence
“It looks like a cat.” are “It looks,” “looks like,” “like a,” and
“a cat,” where “like” is now analyzed with the local context
(i.e., “looks” and “a”), rather than the individual occurrence.
Although one could take into account as wide context as she
wants, simply increasing n will cause data sparseness and
result in unreliable parameter estimation. For such reasons,
n is often set to 2 or 3 depending on the intended application
and the amount of available corpora.
In this work, we aim to improve opinionated document
retrieval and study the use of trigger models for captur-
ing patterns or word dependencies that are characteristic to
subjective opinions.
Subjective Trigger Models
Despite its simplicity, n-gram language models have been
successfully applied to many NLP-related problems. How-
ever, it is clear that there exist long-distance dependencies
beyond the limited horizon specified by n. To include such
dependencies, Lau et al. (1993) proposed the trigger-based
language model (or trigger model for short). A trigger refers
to a word that tends to bring about the occurrence of the
other. For example, “neither” and “nor” are often used as a
pair in the same sentence, such as “I am neither a liberal nor
a conservative”. (We will use this example in the following to
illustrate the trigger model.) An n-gram model is not suitable
for capturing this kind of dependencies because the words
between “neither” and “nor” can be any phrases with any
length. A trigger model PT (w|h) could incorporate such trig-
ger pairs and is used to enhance a baseline n-gram language
model PB(w|h) by linearly interpolating the two:
PE(w|h) = (1 − λ) · PB(w|h) + λ · PT (w|h) (1)
where w and h denote a word and a history, respectively, and λ
is the interpolationparameter. For the above example of “nor,”
the baseline n-gram model PB’s history h is its preceding
words (e.g., “a liberal” for n = 3), and for the trigger model
PT ’s history may be “neither.” We will briefly describe the
definition of PT (w|h) later.
To build a trigger model, we first need to identify signifi-
cant triggering and triggered word pairs (e.g., “neither” and
“nor”). Given a corpus of documents, any word pair, such
as “I → nor” and “am → nor,” in the vocabulary can poten-
tially be a trigger pair. Here, vocabulary is a list of words that
appear in a given corpus. Tillmann and Ney (1996) proposed
a criterion to consider word w as a potential triggered word
only when an n-gram model P(w|h) without smoothing (dif-
ferent from PB(w|h)) gives “poor” estimation for w, meaning
that P(w|h) is smaller than a predefined threshold t. That is,
P(w|h) < t (2)
For example, if P(nor|a liberal) is smaller than t, “nor” is
considered as a potential triggered word b. In other words, the
exact word sequence “a liberal nor” rarely appears in a given
862 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2011
DOI: 10.1002/asi
guage models for capturing such characteristic patterns often
seen in opinionated documents. In particular, we explore the
use of the classic trigger model (Lau, Rosenfeld, & Roukos,
1993; Tillmann & Ney, 1996), which was originally proposed
for dealing with long-distance word dependencies. Our pri-
mary assumption is that there are two essential constituents
to form a personal or subjective opinion. One is the sub-
ject of the opinion (e.g., “I”) or the object that the opinion is
about (e.g., “this movie”), and the other is a subjective expres-
sion (e.g, “like”). We regard the former as a triggering word
and the latter as a triggered word and automatically identify
trigger patterns characteristic to subjective opinions using
customer reviews collected fromAmazon.com. Through sev-
eral experiments on the TREC Blog track test collections, it
is demonstrated that, when used for reranking, our proposed
model significantly improves the IR system performance and
that dynamically adapting the model to a given query gives
steady improvement. Also, it is shown that our approach can
be easily extended to polarized document retrieval, which
distinguishes positive and negative opinions.
In the remainder of this article, we first detail our approach
to building a trigger model for subjective opinions. Then, we
evaluate the validity of our proposed model and its effective-
ness in retrieving opinionated blog posts by way of a variety
of experiments on the Blog track test collections. After that,
we summarize the related work. Finally, we conclude this
paper with a brief summary of the findings and possible future
directions.
Opinion Retrieval by a Trigger Model
Motivation
To judge whether a given document contains subjective
opinions, the simplest and most intuitive approach would be
to look for subjective words in the document. The underlying
assumption of this kind of lexicon-based approaches is that
if a document contains words often used for expressing sub-
jectivity, it is likely to be opinionated. For instance, “like”
may be a good indicator for favorable feelings. Along this
line, many researchers manually or automatically created a
sentiment-oriented word list or dictionary to use for identi-
fying opinions (for example, Lee et al., 2008; Vechtomova,
2010; Yang, Yu, Valerio, & Zhang, 2006). Although reported
effective, a potential limitation of this approach is, as opposed
to the intuition, that a document with such subjective words is
not necessarily opinionated. For example, “It looks like a cat.”
or “She likes singing.” may possibly be a subjective opinion
of the writer but is rather objective; the former uses “like” as
a preposition and the latter is a statement or a fact about a
third person.To distinguish such a difference, one would need
to look at wider context wherein those potentially subjective
words occur.
One way to consider wider context is to use the classic
n-gram language models (Manning & Schütze, 1999), which
estimate the probability of a word occurrence based on the
prior local context. Basically, it treats n consecutive terms as
a unit of analysis. For example, bigrams in the above sentence
“It looks like a cat.” are “It looks,” “looks like,” “like a,” and
“a cat,” where “like” is now analyzed with the local context
(i.e., “looks” and “a”), rather than the individual occurrence.
Although one could take into account as wide context as she
wants, simply increasing n will cause data sparseness and
result in unreliable parameter estimation. For such reasons,
n is often set to 2 or 3 depending on the intended application
and the amount of available corpora.
In this work, we aim to improve opinionated document
retrieval and study the use of trigger models for captur-
ing patterns or word dependencies that are characteristic to
subjective opinions.
Subjective Trigger Models
Despite its simplicity, n-gram language models have been
successfully applied to many NLP-related problems. How-
ever, it is clear that there exist long-distance dependencies
beyond the limited horizon specified by n. To include such
dependencies, Lau et al. (1993) proposed the trigger-based
language model (or trigger model for short). A trigger refers
to a word that tends to bring about the occurrence of the
other. For example, “neither” and “nor” are often used as a
pair in the same sentence, such as “I am neither a liberal nor
a conservative”. (We will use this example in the following to
illustrate the trigger model.) An n-gram model is not suitable
for capturing this kind of dependencies because the words
between “neither” and “nor” can be any phrases with any
length. A trigger model PT (w|h) could incorporate such trig-
ger pairs and is used to enhance a baseline n-gram language
model PB(w|h) by linearly interpolating the two:
PE(w|h) = (1 − λ) · PB(w|h) + λ · PT (w|h) (1)
where w and h denote a word and a history, respectively, and λ
is the interpolationparameter. For the above example of “nor,”
the baseline n-gram model PB’s history h is its preceding
words (e.g., “a liberal” for n = 3), and for the trigger model
PT ’s history may be “neither.” We will briefly describe the
definition of PT (w|h) later.
To build a trigger model, we first need to identify signifi-
cant triggering and triggered word pairs (e.g., “neither” and
“nor”). Given a corpus of documents, any word pair, such
as “I → nor” and “am → nor,” in the vocabulary can poten-
tially be a trigger pair. Here, vocabulary is a list of words that
appear in a given corpus. Tillmann and Ney (1996) proposed
a criterion to consider word w as a potential triggered word
only when an n-gram model P(w|h) without smoothing (dif-
ferent from PB(w|h)) gives “poor” estimation for w, meaning
that P(w|h) is smaller than a predefined threshold t. That is,
P(w|h) < t (2)
For example, if P(nor|a liberal) is smaller than t, “nor” is
considered as a potential triggered word b. In other words, the
exact word sequence “a liberal nor” rarely appears in a given
862 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2011
DOI: 10.1002/asi
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
13 Readers on Mendeley
by Discipline
15% Social Sciences
by Academic Status
31% Student (Master)
31% Ph.D. Student
15% Student (Bachelor)
by Country
31% China
15% Italy
8% Japan


