Analysis of Document Diversity through Sentence-Level Opinion and Relation Extraction
Abstract
Diversity in document retrieval has been mainly approached as a classical statistical problem, where the typical optimization function aims at diversifying the retrieval items represented by means of language models. Although this is an essential step for the development of effective approaches to capture diversity, it is clearly not sufficient. The effort in Novelty Detection has shown that sentence-level analysis is a promising research direction. However, models and theory are needed for under- standing the difference in content of the target sentences. In this paper, an argument for using current state-of-the-art in Relation and Opinion Extraction at the sentence level is made. After presenting some ideas for the use of the above technology for document retrieval, advanced extraction models are briefly described.
Author-supplied keywords
Analysis of Document Diversity through Sentence-Level Opinion and Relation Extraction
Sentence-Level Opinion and Relation Extraction
Alessandro Moschitti
Department of Computer Science and Information Engineering
University of Trento
Via Sommarive 14, 38100 POVO (TN) - Italy
moschitti@disi.unitn.it
Abstract. Diversity in document retrieval has been mainly approached
as a classical statistical problem, where the typical optimization function
aims at diversifying the retrieval items represented by means of language
models. Although this is an essential step for the development of eective
approaches to capture diversity, it is clearly not sucient. The eort in
Novelty Detection has shown that sentence-level analysis is a promising
research direction. However, models and theory are needed for under-
standing the dierence in content of the target sentences.
In this paper, an argument for using current state-of-the-art in Relation
and Opinion Extraction at the sentence level is made. After presenting
some ideas for the use of the above technology for document retrieval,
advanced extraction models are brie
y described.
Keywords: Relation Extraction; Opinion Mining; Diversity in Retrieval
1 Introduction
Diversity in document retrieval has been mainly approached as a classical statis-
tical problem, where the typical optimization function aims at diversifying the
retrieval items represented by means of language models, see for example the
novelty detection track [2]. Although, this is an essential step for the develop-
ment of eective approaches to diversity in retrieval, it is not sucient. Indeed,
while for standard document retrieval, frequency counts and the related weight-
ing schemes help in dening the most probable user information needs, they play
an adversary role in capturing diversity.
For example, when retrieving documents related to the entity Michael Jor-
dan, a huge amount of text will be related to the basket player; perhaps other
items will be related to the Jordan, statisticians and professor, but very few
of them, e.g., will be devoted to the Michael Jordan accounting employee for
Rolfe, Benson LLP. The occurrences of the latter in Web documents will be so
small that no powerful language model will be able to eectively exploit them,
considering the ocean of the basket player related information. In other words,
there will not be enough statistical evidence to build a language model for such
employee, consequently the related context, e.g. words, can be confused with the
one of other documents unrelated to Michael Jordan.
The solution of this problem requires the use of techniques for ne grained
analysis of document semantics. In a statistical framework this means that we
need to extract features semantically related1 to the object about which the users
expressed their information needs. Such features cannot be just constituted by
simple context words as the frequency problem highlighted above would prevent
them to be eective. In contrast, textual relations between entities like those
dened in ACE [8] provide an interesting level of characterization of the target
entity. For example, the sole relation Is employed at can easily diversies the
three Michael Jordan above. A search engine aiming at providing diversity in
retrieval will need to integrate such technology in the classical language model.
Another interesting dimension of document diversity is the opinion expressed
in text. Documents can be 99% similar according to scalar product based on
weighting schemes (especially if traditional stoplists are applied) but express a
completely dierent viewpoint. This is manly due to the fact that documents re-
porting dierent opinions on some events describe them by manly only changing
adjectives, adverbs and syntactic constructions. Typical opinion polarity classi-
ers can help to separate diverse retrieved documents but, when several events
are described, the opinion analysis at the document level is ineective. In con-
trast, by extracting topics, opinion holders and opinion expressions would make
it possible to retrieve documents that are diverse with respect to events and
opinion on them. In this perspective, one main goal of the LivingKnowledge
project2 is to reveal and analyze the diversity of the information in the Web, as
well as the potential bias existing on the related sources.
In the reminder of this paper, Section 2 will report on latest results of
sentence-level Relation Extraction, Section 3 will describe our approach to opin-
ion mining in LivingKnowledge and nally, Section 4 will derive the conclusions.
2 Sentence-Level Relaton Extraction
The extraction of relational data, e.g. relational facts, or world knowledge from
text, e.g. from the Web [26], has drawn its popularity from its potential appli-
cations in a broad range of tasks. The Relation Extraction (RE) is dened in
ACE as the task of nding relevant semantic relations between pairs of entities
in texts. Figure 1 shows part of a document from ACE 2004 corpus, a collection
of news articles.
In the text, the relation between president and NBC's entertainment divi-
sion describes the relationship between the rst entity (person) and the second
(organization) where the person holds a managerial position.
To identify such semantic relations using machine learning, three settings
have been applied, namely supervised methods, e.g. [27, 7, 12, 30], semi-supervised
methods, e.g. [4, 1], and unsupervised methods, e.g. [9, 3]. Work on supervised
1 At a higher level than the simple lexical co-occurences.
2 http://livingknowledge-project.eu/
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



