Sign up & Download
Sign in

A Probabilistic Framework for Information Modelling and Retrieval Based on User Annotations on Digital Objects

by Null Ingo Frommholz
ACM SIGIR Forum (2008)
  • ISSN: 01635840

Abstract

Annotations are a means to make critical remarks, to explain and comment things, to add notes and give opinions, and to relate objects. Nowadays, they can be found in digital libraries and collaboratories, for example as a building block for scientific discussion on the one hand or as private notes on the other. We further find them in product reviews, scientific databases and many ``Web 2.0'' applications; even well-established concepts like emails can be regarded as annotations in a certain sense. Digital annotations can be (textual) comments, markings (i.e. highlighted parts) and references to other documents or document parts. Since annotations convey information which is potentially important to satisfy a user's information need, this thesis tries to answer the question of how to exploit annotations for information retrieval. It gives a first answer to the question if retrieval effectiveness can be improved with annotations. A survey of the ``annotation universe'' reveals some facets of annotations; for example, they can be content level annotations (extending the content of the annotation object) or meta level ones (saying something about the annotated object). Besides the annotations themselves, other objects created during the process of annotation can be interesting for retrieval, these being the annotated fragments. These objects are integrated into an object-oriented model comprising digital objects such as structured documents and annotations as well as fragments. In this model, the different relationships among the various objects are reflected. From this model, the basic data structure for annotation-based retrieval, the structured annotation hypertext, is derived. In order to thoroughly exploit the information contained in structured annotation hypertexts, a probabilistic, object-oriented logical framework called POLAR is introduced. In POLAR, structured annotation hypertexts can be modelled by means of probabilistic propositions and four-valued logics. POLAR allows for specifying several relationships among annotations and annotated (sub)parts or fragments. Queries can be posed to extract the knowledge contained in structured annotation hypertexts. POLAR supports annotation-based retrieval, i.e. document and discussion search, by applying an augmentation strategy (knowledge augmentation, propagating propositions from subcontexts like annotations, or relevance augmentation, where retrieval status values are propagated) in conjunction with probabilistic inference, where P(d rightarrow q), the probability that a document d implies a query q, is estimated. POLAR's semantics is based on possible worlds and accessibility relations. It is implemented on top of four-valued probabilistic Datalog. POLAR's core retrieval functionality, knowledge augmentation with probabilistic inference, is evaluated for discussion and document search. The experiments show that all relevant POLAR objects, merged annotation targets, fragments and content annotations, are able to increase retrieval effectiveness when used as a context for discussionor document search. Additional experiments reveal that we can determine the polarity of annotations with an accuracy of around 80%.

Cite this document (BETA)

Available from www.is.inf.uni-due.de
Page 1
hidden

A Probabilistic Framework for Information Modelling and Retrieval Based on User Annotations on Digital Objects

A Probabilistic Framework for Information
Modelling and Retrieval Based on User Annotations
on Digital Objects
Dem Fachbereich Ingenieurwissenschaften
der Universität Duisburg-Essen
zur Erlangung des akademischen Grades eines
Doktors der Naturwissenschaften
vorgelegte Dissertation
von
Diplom-Informatiker
Ingo Peter August Frommholz
aus Bochum-Wattenscheid
Datum der Einreichung:
21. Mai 2008
Page 2
hidden

Page 3
hidden
Abstract
Annotations are a means to make critical remarks, to explain and comment things, to add notes
and give opinions, and to relate objects. Nowadays, they can be found in digital libraries and
collaboratories, for example as a building block for scientific discussion on the one hand or as
private notes on the other. We further find them in product reviews, scientific databases and
many “Web 2.0” applications; even well-established concepts like emails can be regarded as
annotations in a certain sense. Digital annotations can be (textual) comments, markings (i.e.
highlighted parts) and references to other documents or document parts. Since annotations
convey information which is potentially important to satisfy a user’s information need, this
thesis tries to answer the question of how to exploit annotations for information retrieval. It
gives a first answer to the question if retrieval effectiveness can be improved with annotations.
A survey of the “annotation universe” reveals some facets of annotations; for example, they
can be content level annotations (extending the content of the annotation object) or meta level
ones (saying something about the annotated object). Besides the annotations themselves, other
objects created during the process of annotation can be interesting for retrieval, these being the
annotated fragments. These objects are integrated into an object-oriented model comprising
digital objects such as structured documents and annotations as well as fragments. In this
model, the different relationships among the various objects are reflected. From this model,
the basic data structure for annotation-based retrieval, the structured annotation hypertext, is
derived.
In order to thoroughly exploit the information contained in structured annotation hyper-
texts, a probabilistic, object-oriented logical framework called POLAR is introduced. In PO-
LAR, structured annotation hypertexts can be modeled by means of probabilistic propositions
and four-valued logics. POLAR allows for specifying several relationships among annotations
and annotated (sub)parts or fragments. Queries can be posed to extract the knowledge con-
tained in structured annotation hypertexts. POLAR supports annotation-based retrieval, i.e.
document and discussion search, by applying an augmentation strategy (knowledge augmenta-
tion, propagating propositions from subcontexts like annotations, or relevance augmentation,
where retrieval status values are propagated) in conjunction with probabilistic inference, where
P (d → q), the probability that a document d implies a query q, is estimated. POLAR’s se-
mantics is based on possible worlds and accessibility relations. It is implemented on top of
four-valued probabilistic Datalog.
POLAR’s core retrieval functionality, knowledge augmentation with probabilistic inference, is
evaluated for discussion and document search. The experiments show that all relevant POLAR
objects, merged annotation targets, fragments and content annotations, are able to increase
retrieval effectiveness when used as a context for discussion or document search. Additional
experiments reveal that we can determine the polarity of annotations with an accuracy of
around 80%.
Page 4
hidden

Page 5
hidden
Ackowledgements
I would like to take this opportunity to thank those who accompanied me on the long way
throughout the time this thesis was created, who supported me in several ways and who showed
interest in my work.
I thank my former and current colleagues at Fraunhofer IPSI in Darmstadt and the Infor-
mation Systems group at the University of Duisburg-Essen, especially Holger Brocks, André
Everts, Marcello L’Abbate, Adelheit Stein, Matthias Hemmje, Sascha Kriewel and Claus-Peter
Klas. They always found the time for discussion, to give technical support or just to listen.
Special thanks go to Henrik Nottelmann, a brilliant nice guy who left us much too early, to
Erich Neuhold, who was involved in my work when he was institute director at IPSI, and Piklu
Gupta, a native English and a near-native German speaker (and, besides, a nice guy), who
helped translating even complicated German sentences into English. I also thank Marc Lecht-
enfeld for his fantastic master thesis on machine-learning methods to determine the polarity
of annotations, Dennis Korbar, who helped me by providing the infrastructure to create the
ZDNet testbed, and Ray Larson for reading an early version of this thesis.
Ulrich Thiel was the one who mentored me during my time at Fraunhofer IPSI. He showed
me the “other side of IR”, namely the cognitive, more user-oriented one. Ulrich’s comments
sometimes gave me a very hard time, but made me learning a lot.
Thomas Rölleke made the heart of this work possible by providing his superb HySpirit
framework. Without him, nothing of the proposed framework could actually be executed.
Thanks for good advice, a nice afternoon on a sailing boat and your patience for answering
many questions. And of course thanks for POOL.
During a visit in Padua, I had the opportunity for good and fruitful discussions with Maris-
tella Agosti and Nicola Ferro. Their collaboration enriched my work significantly. I’d like to
thank them for good advice, the nice time I had with them, for the good collaboration in
DELOS and for their interest in my work.
Especially I’d like to thank Norbert Fuhr. He is the person mainly involved in my work. His
inspiration, his deep knowledge and his support paved the way to make this thesis possible. He
was also the one giving me the opportunity to continue the work started in Darmstadt when I
began working at his chair in Duisburg.
Finally, very hearty thanks go to my family and especially my wife Damaris. She is the
one who was suffering most when writing up this thesis, and her infinite patience cannot be
measured.
Thank you.
Ingo Frommholz
Darmstadt/Duisburg, May 2008
Page 6
hidden

Page 7
hidden
Contents
1 Introduction 1
I The Annotation Universe 7
2 The Annotation Universe – Applications, Facets and Properties 9
2.1 Digital Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Definition and Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Annotations in Digital Libraries and Collaboratories . . . . . . . . . . . 11
2.1.3 Annotations on the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.4 Email Discussions and Usenet News . . . . . . . . . . . . . . . . . . . . 16
2.1.5 Semantic Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.6 Scientific Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.7 Linguistic Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Facets of Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Annotations as Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Annotations as Content . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Annotations as Dialogue Acts . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.4 Annotations as References . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.5 Polarity of Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.6 Annotations and Hypertexts . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 A Model of the Annotation Universe for Annotation-based IR 23
3.1 Main Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Digital Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Structured Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.3 Annotatable Objects and Annotations . . . . . . . . . . . . . . . . . . . 27
3.1.4 Fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.5 Annotation Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.6 Scope, Permission and Polarity . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.7 Multiclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Structured Annotation Hypertext . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Annotation Hypertext . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Structured Annotation Hypertext . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Page 8
hidden
ii Contents
II The POLAR Framework 39
4 Annotation-based Knowledge Modelling and Retrieval with POLAR 41
4.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.2 An Overview of Retrieval Models . . . . . . . . . . . . . . . . . . . . . . 43
4.1.3 Hypertext, Structured Document and Web Retrieval . . . . . . . . . . . 47
4.1.4 Annotation-based Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 The POLAR Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Probabilistic Object-oriented Logics for Annotation-based Retrieval . . 54
4.2.3 Document and Query Representation and Description . . . . . . . . . . 54
4.2.4 POLAR Knowledge Modelling . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.5 Querying and Retrieval in POLAR . . . . . . . . . . . . . . . . . . . . . 60
4.2.6 Knowledge and Relevance Augmentation . . . . . . . . . . . . . . . . . . 63
4.3 Further Application Showcases . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Annotation-based Structured Document Retrieval and Discussion Search 71
4.3.2 Enriching a Document Ranking with Annotations . . . . . . . . . . . . 74
4.3.3 Document Access through Fragments and Highlighted Parts . . . . . . . 75
4.3.4 Users and Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.5 Semantic Annotations and Ontologies . . . . . . . . . . . . . . . . . . . 77
4.3.6 Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.7 Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.8 Annotation-based Trustworthiness . . . . . . . . . . . . . . . . . . . . . 79
4.3.9 Access Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.1 Hypertext and Structured Document IR and Discussion Search . . . . . 81
4.4.2 Annotation-based IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 POLAR Syntax and Semantics 89
5.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1.1 Basic Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1.2 Rules and Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.1 Possible Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.2 Basic Knowledge Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.3 Knowledge Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2.4 Queries and Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3 Retrieval Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.3.1 Information Retrieval with Probabilistic Inference . . . . . . . . . . . . 136
5.3.2 Probabilistic Inference in POLAR . . . . . . . . . . . . . . . . . . . . . 137
5.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6 POLAR Implementation 141
6.1 Four-Valued Probabilistic Datalog (FVPD) . . . . . . . . . . . . . . . . . . . . 141
6.1.1 Syntax of FVPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.1.2 Translation to and Evaluation with Probabilistic Datalog . . . . . . . . 142
Page 9
hidden
Contents iii
6.2 POLAR Translation to FVPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.2.1 Basic Knowledge Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.2.2 Queries and Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.2.3 Knowledge Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.2.4 Retrieval Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.2.5 Relevance Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.3 System Architecture and Java Implementation . . . . . . . . . . . . . . . . . . 173
6.3.1 POLAR Translation and Execution . . . . . . . . . . . . . . . . . . . . . 173
6.3.2 POLAR Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.3.3 POLAR Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
III Evaluation 181
7 Example Applications and Test Collections 183
7.1 Emails as Annotations: The W3C Discussion Lists . . . . . . . . . . . . . . . . 183
7.1.1 Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.1.2 The Annotation View on Email Messages . . . . . . . . . . . . . . . . . 184
7.1.3 Collection Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.1.4 Representation in POLAR . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.2 ZDNet News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.2.1 Collection Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.2.2 Testbed Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.2.3 Representation in POLAR . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.2.4 Polarity of Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8 Experiments 195
8.1 Methodology and Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.1.1 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.1.2 Significance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
8.1.3 Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.2 Term Weighting and Retrieval Functions . . . . . . . . . . . . . . . . . . . . . . 197
8.3 Discussion Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.3.1 Description of Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.3.2 Baseline and Whole Email Results . . . . . . . . . . . . . . . . . . . . . 200
8.3.3 Results for Knowledge Augmentation . . . . . . . . . . . . . . . . . . . 200
8.4 Document Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.4.1 Description of Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.5 Determining the Polarity of an Annotation . . . . . . . . . . . . . . . . . . . . 211
8.5.1 Machine Learning for Sentiment Classification in Discussions . . . . . . 211
8.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9 Conclusion and Outlook 217
A Model of the Annotation Universe 221
Page 12
hidden
2 1 Introduction
whole discussion thread was started with lots of agreement and disagreement. You read the
annotations in the discussion thread and gain valuable insights and new references for your own
article.
Now consider another case. You want to write a master thesis about digital libraries, and
to prepare for this thesis, you need a good and easy to understand introduction to the field.
You go to the Web site of a book seller which offers some functionality to search within the
book titles. You type “digital libraries introduction” as a query and find a few books which are
supposed to be introductions. Since you only want to read one of the books, you need to find
out which one of them is really introductory and easy to understand. Fortunately, the system
also presents user reviews, and for one book, you find two important comments:
This book is a good introduction!
It was easy to read and helped me preparing my exams.
Although this book was not on the first position of the ranking produced for your search, you
choose it due to these comments.
A fourth and last scenario. You are writing a PhD thesis and you want to find a certain
paper, but you forgot the title and do not remember the author. All you remember is this
passage in the paper which you found very interesting. You remember marking it with a green
marker and writing a comment on the margin: “Good idea, should include it in my model!”.
Looking at the pile of papers you read throughout the last year, you would give everything for
a hint where to find this annotated passage again! So you make your way through the pile, and
finally, after having examined dozens of papers for this single annotated fragment, you find it,
re-read the important parts of the paper, and integrate its basic idea into your model.
All scenarios, which are fictitious but could have happened each day, have something in com-
mon: user annotations are exploited to satisfy an information need. In the first scenario, main
documents like the evaluation report could not answer the specific question if the navigation
software runs on the desired smart phone; later, the answer could be found in the comments.
In this scenario, we hinted at an imaginary search engine which “knows” that the desired in-
formation is in the annotation-based discussion, and could therefore easily point the following
user with the same problem to it. In the second scenario, which comes from the context of
the COLLATE project presented later, a document is associated with the information need
(“political censorhip”) by ways of annotations, and can therefore be deemed relevant. Further-
more, since the main documents the system deals with are scans, and search engines usually
operate on textual descriptions of their material, the annotations may be one of the few sources
a search engine can exploit to present the user relevant documents. The third scenario shows
the usefullness of annotations saying something about a document. From all the books offered
by the system, the one was chosen which seems to best fit the additional requirement of being
a good and readable introduction. Finally, in the last scenario, a document was found (again)
due to an annotated fragment.
These introducing scenarios shall motivate the main focus of this thesis: to use annotations as
an additional source for information retrieval to satisfy certain information needs. The examples
above are only few of the possible ones involving annotations. In fact, the act of annotation is
many centuries old and has a long tradition and many applications in the non-digital world. But
in recent years, annotations enter more and more the digital world. We find them in commercial
office tools like Word or OpenOffice (where people can insert comments into the text or highlight
important parts), in digital libraries which let users annotate (and through this, interpret) the
material at hand, and in form of product reviews and discussion forums attached to some
Page 17
hidden
Part I
The Annotation Universe
Page 22
hidden
12 2 The Annotation Universe – Applications, Facets and Properties
into a new document. When creating new annotations, users become active content providers
instead of merely being passive readers. Interpretations might help understanding the content
of a document. They are also an important means to reconstruct the original context of a doc-
ument. Annotations may contain reviews and additional information about a document. As we
elaborate later, annotations are an important kind of metadata attached to a document. On
the collection level, annotations can be employed to link information resources, creating new
explicit relationships between them (Neuhold et al., 2004; Agosti et al., 2007a). Furthermore,
annotations can support access and retrieval of the information sources managed in a digital
library repository – the information contained in annotations may be important to judge the
relevance of a document w.r.t. a query, as annotations are a special kind of document context
(Frommholz et al., 2004a). How to model this context and employ it for information retrieval
is the focus of Part II of this thesis.
Closely related to digital libraries are collaboratories. A collaboratory, as formulated by
William Wulf, is defined as a
...center without walls, in which nation’s researchers can perform their research
without regard to geographical location – interacting with colleagues, accessing
instrumentation, sharing data and computation resource, and accessing information
in digital libraries. (Kouzes et al., 1996)
Collaboratories focus on facilitating scientific interaction and collaboration within a team. Be-
sides this, they should support the sharing of data and resources. Annotations can support
collaboratories in the above tasks by providing means to share annotations as well as annotation-
based discussion for the collaborative interpretation of the given material.
Consequently, the DELOS Network of Excellence on Digital Libraries sees annotation as a
new form of communication and identifies the understanding and managing of this new medium
as one of the major challenges in DL research (Del Bimbo et al., 2004). Several studies have
been performed to assist the design of digital library systems supporting annotations on the
user level (Marshall, 1997, 1998), on the conceptual level (Agosti and Ferro, 2003; Agosti et al.,
2004, 2007a) as well as on the system level (Agosti et al., 2005a, 2006). Current annotation
research also deals with the question of how to anchor annotations to the passage they belong
to (Bernheim Brush, 2002).
2.1.2.2 An example: Annotation-based scientific discussion in COLLATE
We present COLLATE as an example of a collaboratory for the humanities which enables
scientific discussion through annotations. The COLLATE3 collaboratory (Thiel et al., 2004)
focuses on historic film documentation, dealing with documents about films of the 20s and 30s of
the last century. Such documents can be, for example, censorship decisions, newspaper articles,
etc. They are digitised and stored in the system repository. COLLATE supports the work
between film scientists in different locations by establishing a collaboration cycle (Frommholz
et al., 2003): users can react to other users’ contribution, and so the cycle continues. Users have
the option of manually assigning keywords to the digitised documents as well as cataloguing
them according to a pre-defined schema. One of the central concepts of COLLATE is to support
document interpretation by enabling scientific discussion about documents through annotation
threads comprised of shared annotations.
3Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material, http://www.
collate.de/
Page 23
hidden
2.1 Digital Annotations 13
Figure 2.1: An annotation thread in COLLATE
Annotation threads consist of the annotated document (or a part of it) as root and nested
annotations connected to the root. The links between the nodes of an annotation thread
(documents and textual annotations) are typed with so-called discourse structure relations.
In COLLATE, the following relations are defined: elaboration (giving additional information),
analogy (describing similarities), difference (describing contrasts), cause (stating a cause for
specific circumstances), background information (e.g., information about the background of
an author), interpretation (of statements), support argument and counterargument (support or
attack other arguments). Figure 2.1 shows an example of two discourse structure relations. The
incorporation of these relations is discussed in more detail in Brocks et al. (2002). Modelling
annotation threads this way gives us explicit information about the pragmatics of statements
(through link types).
Figure 2.2 shows a screenshot of the COLLATE prototype. In the lower right corner we see
in the background a page of one of the typical digitised documents film scientists deal with in
COLLATE. Users can annotate this page, the whole document or a fragment of the page. To
annotate a fragment, the user can mark the fragment with a rectangle, as it happened around
the stamp in the example page. Above the digitised page we see a typical annotation. On the
left hand side of the screenshot there is a window showing the annotation thread belonging to
the document. In front of it, we can see the comment dialogue box for entering a new comment.
The user can see the message she is replying to for reference. She can choose from one of the
annotation types above and type her message. On the right hand side of the dialogue box the
user can request further actions on the new annotation or one the old one, which is one of the
various mechanisms in COLLATE to foster collaboration.
2.1.2.3 Other Systems supporting Annotations
There are many other digital libraries and collaboratories which support several kinds of an-
notations. DAFFODIL (Klas et al., 2004a) is targeted at the support of the digital library
life cycle proposed by Paepcke (1996). While initially focussing on strategic retrieval support,
improvements of DAFFODIL concentrate on interpreting the material at hand, sharing new
insights and creating new knowledge. To support these tasks, the user is provided with basic
annotation functionality like the creation of annotations, browsing of annotation threads and
display of particular annotations (Agosti et al., 2006). As an example for the various possible
Page 29
hidden
2.2 Facets of Annotations 19
created (Brocks et al., 2002). On the other hand, directives and commissives can trigger
further collaborative acts on the meta level. Directives can be used to attempt to get some
other person to do something; an example would be when a user asks the author of a comment
if he could further elaborate on it. The author, in turn, can answer the request with a promise
to provide the needed information (and actually provide it later on). Certain communicative
acts can thus enable collaboration, and they can be realised as annotations.
2.2.4 Annotations as References
Annotations may be references which link together two different documents or parts of docu-
ments. A classical example for such a reference annotation is an arrow drawn from one text
passage to another on the same page in order to bring them in relation. It could also be a
text like “see also the paper I read last week”. So a reference annotation can be combined
with a textual sign (Agosti and Ferro, 2003), meaning that we not only have a link between
two objects, but also some text describing the type or even the semantics of the link. This
way, documents and objects in the whole repository can be brought in relation. With reference
annotations, resources can be chained into paths like it is done in the Walden’s Path system
(Furuta et al., 1997).
2.2.5 Polarity of Annotations
Another facet of annotations is their polarity. Annotations might convey a positive or negative
sentiment towards the object they are annotating. As an example on the content level, a
counter argument relation type, as it is found in COLLATE, indicates a negative polarity
towards an argument in the annotated object, whereas a support argument would implicitly
convey a positive sentiment. On the meta level, reviews of and judgements about an object
might overall be positive or negative.
2.2.6 Annotations and Hypertexts
When annotations are references, they link together the objects contained in the repository,
thus establishing a huge hypertext, according to the definition of hypertext provided in Agosti
and Smeaton (1996). Marshall (1998) considers annotations as a natural way of enhancing
hypertexts by actively engaging users with existing content in a digital library (Marshall, 1997).
But also annotations not being references (like textual comments) are part of this hypertext,
due to their strong relation to the annotated object. Tools like the Multivalent Browser (Phelps
and Wilensky, 1997) let users select a text fragment which is then to be annotated, and also
in COLLATE it is possible to choose a part of an image as a fragment for annotation. So we
might regard such an annotated fragment (or document part or passage) as a single node in this
hypertext. We gain a web consisting of objects (documents or annotations) and their fragments
which are connected through different kind of links. For instance, annotated fragments can be
related to their original object they are contained in with an is fragment of relation; nodes in
an annotation thread are connected via an annotates or has annotation target relation, while
annotations being references introduce a specific references relation. We call the resulting
structure the annotation hypertext (elswhere also referred to as document-annotation hypertext).
If we do not regard documents and annotations as atomic units, but also consider their internal
structure (like books made of chapters, chapters made of sections), each of these structural
elements can be a node in this hypertext in its own right, connected to its related structural
elements through an is part of link. We call such an extended hypertext which combines intra-
Page 31
hidden
2.3 Summary and Discussion 21
functions. This is the main issue of the following chapter, which deals with the definition of
the structured annotation hypertext and a discussion of its properties.
Page 40
hidden
30 3 A Model of the Annotation Universe for Annotation-based IR
3.1.5 Annotation Types
Some annotation systems offer a categorisation of annotations into several types. Furthermore,
according to the considerations in Section 2.2, annotations can contain additional content or
content on the meta level, and thus be categorised into meta level and content level annotations.
Explanations, for example, contain additional content and expand the content of the object
they refer to. On the other hand, highlight markings operate on the meta level; if a passage is
highlighted, the implicit assertion is “this part is important”, but there is no additional content.
Another example of meta level content are judgements, where people state their opinion about
a document. To distinguish between these kinds of annotation types, we create new classes
ContentLevelAnnotation v Annotation
and
MetaLevelAnnotation v Annotation
and categorise our example annotation types accordingly, for example:
Highlighting v MetaLevelAnnotation
Judgement v MetaLevelAnnotation
Explanation v ContentLevelAnnotation
3.1.6 Scope, Permission and Polarity
Annotations can be private, shared or public. In the first case, only the creator of an annotation
has the right to access it; in the second case, the annotation is visible to a whole group. In the
last case, everyone can see the annotation. We thus need to model the scope of an annotation,
expressed by the scope property. It reflects whether the annotation is public, private or shared,
so possible values of this property should be like that. In order to ensure authorised access to
annotations, the scope goes together with permissions. Shared annotations might be seen by
several groups, and not necessarily only the group the author is member of. Annotation-based
retrieval functions should actually return only those annotations which are accessible by the
current user, which can either be because the annotation is public, or it is shared and the user
belongs to a group the annotation is to be seen by, or it is private and the user is the author
of the annotation4. In any case, we need to know by which groups an annotation can be seen
in order to properly handle access to shared annotations. We therefore introduce the property
seenBy between Annotation and Group. An annotation might be seen by one or more groups
or no group at all (in case of private annotations). Public annotations imply that they can be
seen by any group. We do not further discuss mechanisms to prohibit scope and permission
conflict as they are not in the focus of this thesis; readers interested in these issues are referred
to Agosti and Ferro (2007).
Another attribute of annotations is their polarity (if known). For example, annotation types
like “agreement” and “support argument” have a clear positive polarity, since they express a
positive sentiment about the annotated part. In contrast, “disagreement” and “counterargu-
ment” convey a negative sentiment about the annotated content. In these cases, the annotation
type determines the overall polarity of the annotation. In other cases, the polarity might not
be clearly derivable from the annotation type, so the polarity might be determined by the
annotation content (for example, a comment itself does not have a certain polarity, but there
4An appropriate access policy is due to the actual application and shall not be discussed here.
Page 42
hidden
32 3 A Model of the Annotation Universe for Annotation-based IR
annotation hypertext (or structured document-annotation hypertext), which is similar to the
one in Agosti and Ferro (2005), but with certain differences, as it deals with fragments and
structured documents. For our further considerations, C(o) means that an instance o belongs
to a class C, while R(a, b) means instance a has value b for the property R. Before we continue,
we need to prohibit that, for example, an annotation annotates or references itself or a fragment
is a fragment of itself, or that a subcomponent is part of itself:
Constraint 1 (Loopless): The property value of an instance cannot be the instance itself
(but there may be properties having (other) instances of the same type):
R(a, b)⇒ a 6= b
for each property R. 2
We further must prohibit that annotations reference the same objects they annotate (Agosti
and Ferro, 2007):
Constraint 2 (Annotation targets and references): If an annotation a annotates an ob-
ject o and references an object o′, they must not be the same:
hasAnnotationTarget(a, o) ∧ references(a, o′)⇒ o 6= o′
for all instances a, o, o′. 2
We begin our further considerations with the definition of the annotation hypertext and a
discussion of some important properties. The annotation hypertext can be derived from the
instances of our model.
3.2.1 Annotation Hypertext
An annotation hypertext is composed of digital objects as nodes and specific relations among
them as edges.
Definition 1 (Annotation Hypertext):
An annotation hypertext (or document-annotation hypertext) is a labelled digraph H =
(V,E) with N as the set of vertices and o ∈ N iff DigitalObject(o). E ⊆ V × V is the
set of edges and l : E −→ Σ∗ is a labelling function over an alphabet Σ. The annotation
hypertext is derived from the instances of our model as follows:
• (n,m) ∈ E if hasAnnotationTarget(n,m);
• (n,m) ∈ E if references(n,m);
• (n,m) ∈ E if isFragmentOf(n,m).
Other properties are not considered in the annotation hypertext. For each e = (n,m) ∈ E
it is
l(e) =



“hasAnnotationTarget” iff hasAnnotationTarget(n,m)
“references” iff references(n,m)
“isFragmentOf” iff isFragmentOf(n,m)
Page 44
hidden
34 3 A Model of the Annotation Universe for Annotation-based IR
1. t(a) > t(y) for each i ∈ {x} ∪ SUBx with hasAnnotationTarget(a,i) or references(a,i);
each object annotating or referencing x or one of its subcomponents must be younger
than the youngest component;
2. t(f) > t(y) for each i ∈ {x} ∪ SUBx with isFragmentOf(f ,i); each fragment of x or its
subcomponents must be younger than the youngest component;
3. t(o) > t(u) if hasAnnotationTarget(x,u) or references(x,u); if x annotates or references
another annotatable object u, the oldest component must by younger than u. (Any
s ∈ SUBx can be an annotation target, but not be an annotation itself.)
Let u, v ∈ {x} ∪ SUBo. The above conditions prohibit that t(u) > t(i) > t(v) with i /∈
{x} ∪ SUBx annotating or referencing v or a fragment of v. In this case, u would have been
added to {x}∪SUBx after a component or subcomponent was annotated. u would be younger
than i, which would violate condition 1. Similarly, in case i is a fragment of v, we would have
condition 2 violated. Finally, if u annotates i or a fragment of i, then condition 3 is violated.2
We define the structured annotation hypertext as an extension of annotation hypertexts
dealing with structured documents and annotations. We can create a structured annotation
hypertext from annotation hypertexts by adding the structural relationships between compo-
nents which are given by the isPartOf property.
Definition 2 (Structured Annotation Hypertext):
A structured annotation hypertext SH = (V ′, E′) is an extension of an annotation hypertext
H = (V,E) and created as follows:
• V ′ = V,E′ ⊇ E
• If isPartOf(n,m) then (n,m) ∈ E′ and l((n,m)) = “isPartOf”.
Note that H is a subgraph of SH. V ′ = V because all digital objects, even components, are
vertices in annotation hypertexts. An annotation hypertext is thus a structured annotation
hypertext without “isPartOf” edges.
Example 1 (Structured annotation hypertext): Figure 3.2 shows an example of a struc-
tured annotation hypertext. Here, document d1 is a structured document with subcomponents
s1 and s2. The annotation a1 annotates s2. f is a fragment of d1 which is annotated by a3.
a3 annotates both f and a1, and is a structured annotation containing t. a2 annotates s2 and
references it to d2, thus creating a link from s2 to d2. 2
In (3.6) on page 28 we defined structured annotations and documents by means of compo-
nents, which are objects having exactly one isPartOf property. The problem is that a situation
as it is depicted in Figure 3.3 still can occur – s2 is part of s1, s3 is part of s2, but s1 is part of
s3, so we have a cycle here and a set of components from which none of them is a subpart of
a document or annotation. This is certainly not what we want, so we have to ensure that in
a set of connected components, there exists one component which is part of an annotation or
document. If this is the case, each subgraph consisting only of “isPartOf” relations is a tree.
Page 45
hidden
3.2 Structured Annotation Hypertext 35
Figure 3.2: Example of a structured annotation hypertext
Constraint 5 (Structured objects are trees): Let
AN(x) = {y|component(y) ∧ (isPartOf(x,y) ∨ (isPartOf(x,y′) ∧ y ∈ AN(y′)))}
be the set on ancestor components of a component x. Either x or one of its ancestor components
must be part of a document or annotation:
∀x : component(x) =⇒ ∃y ∈ {x} ∪AN(x) : isPartOf(y,y′) ∧ (Annotation(y′) ∨ Document(y′))
2
Applying this constraints forbids the structure in Fig. 3.3.
We also forbid that a component or subcomponent of an annotation can annotate or be
annotated by itself:
Constraint 6 (Self annotation and reference): Let a be a structured annotation and
SUBa be the set of its subcomponents which are directly or indirectly connected with a through
the isPartOf property. No element from SUBa can be an annotation, and Constraint 1 prohibits
that a annotates itself. Let SH = (V,E) be the structured annotation hypertext containing
a, so that a ∈ V . Then there must not exist any edge (a, n) ∈ E with n ∈ SUBa. We
also have to take into account the case that a references or annotates a fragment of its sub-
components. So we say that there must not exist an edge (a, n) ∈ E and (n,m) ∈ E with
l((n,m)) = “isFragmentOf” or and m ∈ SUBa. Figure 3.4.a illustrates some forbidden cases
of self annotation. 2
Note that Constraint 3 alone would not prohibit self annotation.
Figure 3.3: Cyclic component structure
Page 47
hidden
3.3 Summary and Discussion 37
on existing ones addressing the objects in digital libraries (Gonçalves et al., 2004), annotations
of digital content (Agosti and Ferro, 2007) and structured document retrieval (e.g., (Fuhr et al.,
2002; Fuhr and Großjohann, 2004; Chiaramella et al., 1996)). We apply Description Logics to
describe our model; the T-Box shows the relations between certain classes like digital objects,
fragments, annotations and documents, whereas the A-Box specifies the individuals. On this
level we identified the structured annotation hypertext and showed that it is acyclic.
The model presented in this chapter can be used as the underlying data structure to design
future annotation-based retrieval methods, combining structured documents with annotation
hypertexts. It integrates, sometimes extends or weakens, many concepts, relationships and
constraints found in the different models mentioned above. The proposed model does not claim
to be yet another model of annotations and digital libraries besides the ones already mentioned,
but emphasises the objects which we deem useful and which should be considered for retrieval
involving annotations. How these objects are used for retrieval is subject of the next part of
this thesis, and if exploiting them really improves retrieval effectiveness is discussed in Part III.
It is understood that this model should not be seen as carved in stone; instead, it should be
regarded as an open model and may be tailored, if necessary, to a specific annotation-based
retrieval application.
We are now ready to introduce the main contribution of this thesis, which is a probabilis-
tic, object-oriented, logic-based framework for annotation-based retrieval. The framework, its
functionality, syntax, semantics and implementation is discussed in the next part of this thesis.
It operates on a structured annotation hypertext, like the one discussed in this chapter, as the
underlying data structure.
Page 49
hidden
Part II
The POLAR Framework
Page 52
hidden
42 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
context of hypertext and structured document IR, discussion search and of course annotation-
based retrieval. The consecutive section concludes this chapter and discusses its main findings.
4.1 Information Retrieval
This section gives an overview about information retrieval. It commences with an introduction
to the field and then presents important retrieval models. Since in POLAR we deal with
hypertexts and structured documents, these topics are subsequently introduced. Finally, the
relatively young field of annotation-based IR is presented.
4.1.1 Introduction
According to Baeza-Yates and Ribeiro-Neto (1999), information retrieval (or IR for short)
“deals with the representation, storage, organization of, and access to information items. The
representation and organization of the information items should provide the user with easy
access to the information in which he is interested.”. The main goal of IR is thus to let the
user easily access the information, stored in information items pooled into a collection, he or
she needs to satisfy an information need which arises when fulfilling a certain task. So we have
an information need on the one side, and a set of information items (usually called documents
in the IR context) containing relevant information on the other side. In fact, finding relevant
information for a given information need is a difficult task, as Mizzaro (1998) describes. The
goal is to deliver the user exactly the information which is relevant to the user’s real information
need at the given time, to the given topic, to fulfil the given task within a specific context.
Today’s IR systems and search engines can support this goal only to a certain degree. Problems
arise, for example, in describing the actual information need and to transform it into a (keyword-
based) query language the IR system understands. On the other hand, many IR systems see
documents as atomic units and return them instead of the single piece of information which
is relevant. Users may wonder why a document was retrieved and need to find the relevant
information within the document. Instead of returning the information relevant to a real
information need, contemporary IR systems usually return documents which are relevant to a
query.
We can distinguish between two different views on the IR problem (Baeza-Yates and Ribeiro-
Neto, 1999, p. 7): the computer-centred view focusses on building efficient indexes and devel-
oping algorithms which process a query efficiently and effectively, i.e. they should return a
high-quality ranking of documents w.r.t. the query as quickly as possible. In contrast to that,
the human-centred view studies the users’ behaviour, tries to understand his or her information
needs and how retrieval systems can be operated to best satisfy the information need. The
computer- and human-centred view are not mutually exclusive; in fact, query processing as
studied in the computer-centred view is an important strategy and may be part of a bigger
solution to satisfy information needs. When talking of IR, usually the computer-centred view is
meant. The human-centred view is sometimes referred to as information seeking and searching.
In this thesis, we mainly focus on the computer-centred view of IR (and use the term “IR” for
this view), baring in mind that the results of the work presented here might as well be inter-
esting in the human-centred view. In fact, the approach presented later is aimed at providing
a flexible framework which can be operated to support sophisticated information needs.
Page 54
hidden
44 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
4.1.2.2 Non-Probabilistic Models
Several retrieval models exist which can be described by means of the conceptual model (Baeza-
Yates and Ribeiro-Neto, 1999, chapter 2). In the Boolean model, documents are described as
term sets. If there are T terms in the collection, the document description is a T -dimensional
vector ~d with di ∈ {0, 1} and di = 1 if ti appears in d, and 0 otherwise. Query descriptions are
Boolean expressions like “t1 ∧ (t2 ∨ ¬t3)”. The Boolean retrieval function ρ returns documents
which match the Boolean expression.
The vector space model (VSM) (Baeza-Yates and Ribeiro-Neto, 1999, section 2.5.3) is a
retrieval model which was first introduced by Salton when working on the SMART project.
It recognises that the use of binary weights and the sharp partitioning into matching and
non-matching documents is too limiting. Therefore, documents and queries in the VSM are
described as vectors of (negative or positive) term weights in a T -dimensional vector space.
The retrieval function ρ determines the similarity between the document and the query vector,
for instance by calculating the cosine of the angle between query and document vector, or the
scalar product, so that ρ(dD, qD) = ~d · ~q is the RSV of d w.r.t. q. The system returns a ranking
of documents according to decreasing RSVs.
To determine term weights, it is common in IR to use statistical values like the term frequency
tf ij of a term ti in a document dj . tf ij is usually normalised by the document length (and
denoted ntf ij then). A simple example is ntf ij = tfij/maxl(tflj) where maxl is the frequency
of the most frequent term in the document. Another important component is the inverse
document frequency idf i which is calculated upon the number of documents ti appears in. For
example, idf i = log(N/ni) with N as the number of documents in the collection and ni as the
number of documents ti appears in. The motivation behind this factor is the assumption that
terms which appear in many documents are less discriminatory than those appearing in less
documents. Therefore, idf increases the more rare a term is. Many retrieval approaches based
on the VSM balance the two factors tf and idf , for example by using wij = ntf ij · idf i as the
weight for ti in dj . The document dj is then described by the vector ~dj = (w1j , . . . , wTj). We
call such approaches tf × idf -based methods.
4.1.2.3 Probabilistic Models
The aim of most probabilistic models (Crestani et al., 1998; Fuhr, 1992) is to rank documents
in decreasing order of P (R|q, d), the probability of relevance of a document d with respect
to the query q. The Probability Ranking Principle (PRP) gives a theoretical justification for
creating a ranking based on this probability (Robertson, 1977). Probabilistic relevance mod-
els thus estimate the probability P (R|q, d), often by applying Bayes’ Theorem and making
certain independence assumptions. We can roughly distinguish between model-oriented and
description-oriented approaches. The former are based on some probabilistic independence
assumptions.
Model-oriented approaches can be categorised into query-related and document-related learn-
ing (Fuhr, 1992). An example for a query-related approach is the binary independence retrieval
model (BIR) (Robertson and Sparck Jones, 1976) which utilises relevance feedback1 data to re-
weight search terms of a given query q. In BIR, a document d is represented as a T -dimensional
term vector ~x, so that P (R|q, d) becomes P (R|q, ~x). In contrast to that, the binary indepen-
dent indexing (BII) approach (Fuhr and Buckley, 1991) is query independent but document
dependent. In the BII model, two probabilities have to be estimated: P (R|d), the probability
1Relevance feedback is the process of judging documents in a ranking as relevant or non-relevant.
Page 57
hidden
4.1 Information Retrieval 47
4.1.3 Exploiting the Neighbourhood – Hypertext, Structured Document and Web
Retrieval
The retrieval and indexing models discussed so far regard documents as atomic units – no
relations between documents or within them are considered. But the emerge of hypertext
systems in the late 80s and of course the World Wide Web (WWW) made it possible to broaden
the view and see documents and document fragments embedded in a linked environment.
According to Agosti (1996), a hypertext is composed of nodes and a network of links. Nodes
may be document fragments, but also single documents. In the former case, links between
these nodes represent intra-document relations like the document structure; in the latter case,
links between nodes are inter-document relations, for example to similar documents. Within
the WWW we usually find a mixture of intra- and inter-document links. Links can be classified
in different link types, like structural links (reflecting the document structure), referential links
(e.g. between a document and a document that is citing it) and further unspecified associative
links (Agosti and Melucci, 2000). There can be numerous other link types in a system (Trigg,
1983, chapter 4). Nodes may represent different kinds of digital media (like videos, images,
sound); to reflect this, the term hypermedia is commonly used.
One of the main differences between hypertext and standard document collections is the
additional possibility to realise information search by navigation and browsing. Relevant doc-
uments are not only found by examining a linear ranking delivered by a typical search engine,
but also by following links pointing to potential other relevant nodes in the hypertext. To this
end, classical IR methods can aid navigation and browsing on the one hand, and searching
on the other hand. They can aid navigation and browsing by giving hints and links to next
relevant passages (Hammwöhner and Thiel, 1987) or automatically construct new hypertext
links of different type like similarity links (Agosti and Melucci, 2000). IR methods can further
help in finding entry points in the hypertext which are a good starting point for further navi-
gation. Neighbouring nodes and their corresponding (possibly typed) links (e.g. established by
citations) can be exploited in order to calculate the relevance of a node. This way, the hyper-
text defines the context a node/document is embedded in. The interested reader is referred to
Agosti and Smeaton (1996) for a thorough discussion of hypertext retrieval including references
to many approaches.
A very special case of hypertexts are structured documents. Nodes in structured documents
are document fragments which are connected by structural links. An example is a book which
is made of chapters, sections, subsections and so on. A chapter node, for instance, may point
to several section nodes, which again are linked to subsection nodes. Structured documents
basically form trees. The logical structure of documents is nowadays represented using the eX-
tensible Markup Language (XML). The success of XML as a means to exchange formatted data
(like database records) on the one hand, and for the representation of structured documents
(containing text, multimedia and metadata) on the other hand motivated the creation of new
models and methods for XML information retrieval (XML IR). Like many classical hypertext
approaches, XML IR methods exploit the structural context of nodes to allow for more pre-
cise access to the relevant content (by trying to find a best entry point within the document
structure). The fact that, in contrast to HTML, nodes in an XML tree usually convey defined
semantics, allows for the definition of complex retrieval tasks which are not only able to process
traditionally content-oriented queries, but can also handle hints about the type of elements to
be retrieved (so called content-and-structure queries). The success of XML also meant that test
collections for structured document IR became available. These led to the foundation of the
Initiative for the Evaluation of XML Retrieval (INEX), whose goal is not only the compara-
Page 59
hidden
4.1 Information Retrieval 49
but just want to retrieve any relevant object in the repository. Since the desired objects are
embedded in a context given by their linked items, annotation-based IR is a kind of context-
based retrieval. Due to its underlying data structure, annotation-based IR is strongly related
to hypertext (or hypermedia) IR as well as to structured document retrieval or XML IR2.
4.1.4.2 Relation to Hypertext and Structured Document IR
Like the structured annotation hypertext is a special kind of hypertext, annotation-based IR
can be viewed as a special kind of hypertext IR in which we distinguish between different basic
objects, the (main) documents and the annotations. Known hypertext retrieval approaches
do not take the peculiarities of annotation hypertexts into account, so they are, although in
general applicable, usually too generic for the given problem.
Annotation-based IR can also be regarded as an extension to structured document retrieval
or XML IR, since it does not only consider the internal logical structure of documents, but
also external objects attached to them (the annotations). Classical XML IR methods would
cover only a subset of the possible retrieval tasks when dealing with structured documents and
annotations. We see from this discussion that annotation-based IR indeed establishes a new
class of retrieval tasks and problems to solve.
4.1.4.3 Possible Benefits
What do we hope to gain from annotation-based IR? One possible benefit is that it potentially
addresses the vocabulary problem. Usually, retrieval approaches assume that the document
author and the user share the same vocabulary, which is then adapted by the (automatic)
indexer. This assumption does not always hold – for instance, users and author might use
different terms for the same facts or even circumscribe them (an expert in a specific field might
use a different vocabulary than the common user). A fact possibly changes its describing
terms after some time (this is especially true with the interpretation of historic texts). With
annotations, a third player joins the game, the annotator. If we combine the evidence coming
from documents and annotations, chances are higher that the user uses a term for a fact which
either the document author or the annotator used. An example is the term ‘land’ in the
Bible, which generally, but not always, refers to the land of Israel or Canaan (Fraenkel and
Klein, 1999). So annotations can help in determining the relevance of passages where the term
‘land’ appears by associating them to Israel or Canaan, but also help to find the exceptions.
As Fraenkel and Klein explain, a query for “war in Babylon” would only be successful when
annotations are involved, since in a relevant passage3 ‘land’ exceptionally refers to ‘Babylon’,
which is only mentioned in an annotation coming from a Bible commentary. Another example
comes from the COLLATE scenario and can be seen in Figure 2.1 on page 13. Consider a
censorship document which says that a film was censored for, say, morality reasons. The first
annotator interprets the document in a way that she thinks the censorship reasons were actually
political ones. This way, the document is associated with political censorship, although from
its content there is no evidence for it. The document is probably relevant (together with
its attached annotation) when film students search for political censorship. In the case that
annotations express facts differently or associate things, they help in finding more documents
which would otherwise not be found, that is, they increase the recall. The danger, of course, is
2If we see XML IR detached from the framework it is named after as a variant of structured document retrieval,
that is. We do not want to imply that annotations should be encoded in XML.
3“A sound of war is in the land, and of great destruction.” (Jeremiah 50:22)
Page 62
hidden
52 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
4.2.1.2 Going Beyond POOL
POOL is s framework to model structured, complex objects. Its object-oriented design fea-
tures provide some helpful mechanisms like the possibility to describe complex objects through
propositions, classifications and attributes. An example POOL program (taken from Rölleke
(1998)) shall illustrate this. It models a complex, structured document d1 having the two
subparts s1 and s2. It also shows some interesting feature POOL provides, namely four truth
values with an open world assumption. Consider the following POOL object:
d1[ 0.9 s1[ 0.8/0.2 sailing ]
0.7 s2[ 0.6/0.4 sailing ]]
The document d1 consists of the two subparts s1 and s2; these are accessed with 0.9 and 0.7
probability, respectively. In the context of s1, the term “sailing” is true with 0.8 probability,
and it is true with a probability of 0.6 in the context of s2. In POOL, we can also specify the
probability that propositions are false (and even the probability that they are inconsistent); in
the example, “sailing” is false with 0.2 probability in s1 and with a probability of 0.4 in the
context of s2. A query for documents about “sailing”, expressed in POOL as
?- D[sailing]
would return a ranking containing s1 and s2, but also d1 due to an approach called knowledge
augmentation which propagates the weights from the subcontexts s1 and s2 to the augmented
context d1(s1,s2) for the document d1. Propositions in POOL cannot only be terms, but also
classifications and attributes, making it a powerful tool to describe spatial relations as it would
be useful for multimedia documents.
Tree-like annotation threads might be modelled in POOL so that each annotation is a subcon-
text of the object it annotates. Supporting structured multimedia objects is a desirable feature
for annotation-based IR, as annotations and annotated objects can be multimedia documents
as well – one could think of voice comments or even video annotations. Another interesting
feature of POOL is its ability to deal with four-valued logics, which provides means to cope
with inconsistent and contradicting knowledge – indeed, also with annotations, knowledge can
get inconsistent and contradicting in case one annotator says that a proposition is true and an-
other one states that the same proposition is false. Inconsistencies and contradictions naturally
arise in annotation-based discussions. Additionally, the open world assumption supported by
POOL says that if there is no evidence that a proposition is true, we cannot infer that it is false
(like we would with a closed world assumption). This is an interesting feature for document
indexing, because the lack of an index term for a document does not mean that it must not be
indexed with that term. POOL can also create new intensional knowledge by means of rules,
and it is possible to pose sophisticated queries to the underlying knowledge base. Additionally,
with POOL it is possible to estimate the implication probability P (d→ q) as a retrieval status
value.
So why not just use POOL to model structured annotation hypertexts and perform
annotation-based IR? POOL is a powerful framework, and one of our goals is to reuse its
main ideas described above for our problem of annotation-based IR and for modelling struc-
tured annotation hypertexts. But as already outlined, annotation-based IR extends classical
structured document retrieval which POOL mainly aims at. While POOL can indirectly model
hypertexts and thus annotation hypertexts (by means of attributes and categories which can
describe links), we want to represent and support some of the special elements of structured an-
notation hypertexts directly and less cumbersome. POOL copes with tree structures, whereas
Page 64
hidden
54 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
subcontexts besides subparts). From an implementation point of view (which is an issue in
Chapter 6), POLAR is a kind of twin sister of POOL since both are based on four-valued
probabilistic Datalog. From a functional point of view, POLAR extends POOL in that POOL
programs can be evaluated by POLAR as well, but usually not the other way round.
4.2.2 Probabilistic Object-oriented Logics for Annotation-based Retrieval
POLAR has its roots in probability theory, object-oriented modelling and four-valued predicate
logics.
Probability theory is a well-defined framework for capturing uncertain knowledge we are
dealing with in IR. From this point of view, POLAR stands in the tradition of probabilistic
retrieval models on the one hand and can be used as a vehicle to implement new probabilistic
models on the other hand. Attributes, categorisations as well as index terms are assigned
probabilities; from these probabilistic facts, new knowledge can be derived according to the
rules of probability theory.
Object-oriented modelling is a well-known approach to create models of real world scenarios.
In POLAR, documents and annotations are (complex) objects, which makes it compatible to
the more general object-oriented view introduced in Chapter 3. On the global database level,
objects can have attributes and they can be classified. POLAR supports the aggregation of ob-
jects by means of knowledge augmentation. Augmented contexts represent objects aggregated
with their neighbouring context composed of annotations, referenced objects and the logical
document structure. POOL’s object-oriented features are preserved, but with an extended
notion of annotation-based aggregation.
Four-valued predicate logics are used to model the content of complex objects as well as their
relations to other objects. This content is represented as terms (propositions), classifications
(unary predicates) and attributes (binary predicates) as well as access to other objects. Four-
valued logics support the proper aggregation of objects by dealing with inconsistent knowledge
coming from different sources.
The aim of the probabilistic, object-oriented logic-based framework is to support annotation-
based retrieval. POLAR thus provides means to query structured annotation hypertexts on the
one hand and methods based on probabilistic inference on the other hand.
In POLAR, we further distinguish between the object and the global database context. In
the object context, classifications, attributes and terms allow for a sophisticated representation
of objects, combining the content and logical view on documents together with annotations.
Classifications and attributes represent factual knowledge about objects in the global database
context (e.g., is an object a document or an annotation, and what kind of annotation). It also
contains metadata about objects.
4.2.3 Document and Query Representation and Description
Recall the classes and properties of the object-oriented view on structured annotation hyper-
texts given in Figure 3.1 on page 24, and the conceptual retrieval model introduced in Sec-
tion 4.1.2.1. In our further considerations, a document d in the conceptual model is an instance
of AnnotatableObject in the object-oriented view. The transformation αD turns the annotat-
able object d into its document representation d. This representation has to be rich enough to
capture the context of annotatable objects. It should not only contain the body of the object,
for example as a bag of words in case of textual documents, but also its properties containing
metadata and links to other objects, depending on the actual subclass of AnnotatableObject d
Page 66
hidden
56 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
document(d1)
comment(a1)
means document(d1) and comment(a1), respectively. Due to the above rules,
annotatableObject(d1), annotatableObject(a1) and annotation(a1) is inten-
sional knowledge derived implicitly.
4.2.4.2 Metadata
Metadata can be expressed by means of (possibly probabilistic) attributes and categorisations.
For example,
d1.author(tim)
says that Tim is the author of d1.
The rules, categorisations and attributes presented above are all created in a certain context,
the so-called global database context. But in POLAR, each document and annotation describes
a context in its own right, as we are going to discuss now.
4.2.4.3 Documents and Annotations As Complex Objects
Complex Objects Documents and annotations in POLAR are complex objects and described
by contexts containing probabilistic propositions, which can be terms, classifications and at-
tributes. These propositions are derived from the document and annotation representation,
which in turn is extracted from the structured annotation hypertext. Textual content is the
source for term propositions in POLAR; their probability can be estimated using traditional
tf -based measures normalised to the range between 0 and 1. Depending on the document type,
there might also be multimedia content. Such content can be described with categorisations
and attributes in POLAR. Furthermore, documents and even annotations might be structured.
For example,
d1[ 0.5 information 0.6 retrieval
0.7 digital 0.3 libraries
s1[ 0.4 information 0.2 retrieval ]
m1[ o1[] o2[]
house(o1) tree(o2) o2.leftOf(o1) ]
states that the context d1 can be described by the term propositions ‘information’, ‘retrieval’,
‘digital’ and ‘libraries’ with the corresponding probabilities (as the outcome of a text indexing
process) and has a subpart (component) s1. s1 is a subcontext of d1. s1 can be indexed with
‘information’ and ‘retrieval’. Furthermore, a multimedia object m1 might be described by a
categorisation of its components and spatial properties, which can be expressed as attributes;
in this example, it says that m1 has two components o1 and o2 which are a house and a
tree, respectively, and o2 appears left of o1. With these mechanisms we are able to deal with
structured textual and multimedia documents and annotations.
To extract the logical structure from the structured annotation hypertext and represent it in
POLAR, we utilise the isPartOf relation: d1[p s1[]] iff isPartOf(s1,d1). p is the probability
that we access s1 from d1, the so-called access probability. We come back to this probability
in a later discussion. The propositions made in the context of a document or annotation and
their weights are derived by indexing the body of the corresponding DigitalObject instance.
Page 69
hidden
4.2 The POLAR Framework 59
Fragments When users create an annotation about a certain passage of a document, they
first select the corresponding document fragment. This fragment is also a part of a document,
and the fact that this was an annotation target should be expressed in our framework as well,
since this knowledge can be valuable in the retrieval process. For example,
d1[ 0.5 information 0.6 retrieval
0.7 digital 0.3 libraries
0.8 f1|| 0.9 digital 0.5 libraries 0.7 *a1|| ]
means that a fragment f1 of d1 which is about digital libraries was selected as an an-
notation target for a1; we refer to this fragment as an annotated part of d1. Formally,
d1[p1 f1|| p2 *a1||] iff isFragmentOf(f1,d1) and hasAnnotationTarget(a1,f1). p1 is the
probability that we access the fragment f1 from d1, and p2 is the probability that we access a1
from f1.
Fragments have a special property regarding annotations. We say that if an annotation
annotates a fragment, it also annotates the object the fragment belongs to. For example, if
a user selects a part of a paragraph and annotates this fragment, we regard the annotation
as belonging to both the fragment and the paragraph. Therefore, d[f|| p *a||] implies
d[ p *a]. We call this special property fragment permeability.
Merged Annotation Targets Annotation targets (i.e. the objects or fragments which are
annotated) may contain important information to determine the relevance of an annotation.
As a simple intuitive example, consider a fragment about digital libraries which is annotated
with a comment “This is an important new technology”. A reader of this annotation has to
refer to the annotation target to resolve the anaphora “this” and to learn that the annotation
talks about digital libraries. We see that the content of annotation targets is an important
context when searching for annotations, which is also confirmed later by the experiments in
Chapter 8.
Consider the following example:
a1[ 0.8 t1< 0.7 digital 0.8 libraries >
0.6 important 0.8 new 0.7 technology ]
t1 is the merged annotation target (or shortly merged target) of a1 and is about digital libraries.
More generally we say that an expression a1[p t1<. . .>] states that t1 is the merged target
of a1 and that this context is accessed with probability p. Merged annotation targets are
constructed as follows. Let Ta = {o|hasAnnotationTarget(a, o)} denote the set of annotation
targets of a. Instead of considering each annotation target on its own, we see content an
annotation refers to in an integrated way and create a new virtual document t which contains
all propositions of each of a’s annotation targets. Therefore, t = ∪o∈Tao. In the indexing step,
probabilities are calculated for each proposition, e.g. based on the term frequency and length
of the newly created document t in case of terms. One reason to do so comes from our view of
emails as annotations, as it is discussed later in Section 7.1.2.
4.2.4.5 Special Commands
POLAR supports a set of special commands, which are prefixed by a “_”. For example,
_echo("Print me")
prints the string “Print me” to the console. We will introduce additional special commands
when appropriate.
Page 72
hidden
62 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
4.2.5.3 Content-oriented Queries
Content-oriented queries deal with uncertain knowledge and calculate a value for each object
w.r.t. the query and according to the probabilities of their propositions.
?- A[ search ]
returns all objects containing ‘search’.
?- D[ information & retrieval ]
returns all documents about “information AND retrieval”.
POLAR provides a special syntax to query for fragments explicitly. The query
?- F|| digital & libraries ||
returns annotated parts about digital libraries. This kind of query enables direct access to
annotated document fragments in case users are only interested in these parts.
4.2.5.4 Retrieval by Implication Probability
The content queries so far exploit the content of objects to fetch them if they fit to the given
POLAR query. If, for example, the probability of term propositions is based on the within-
term frequency (tf ) of objects, we gain a tf -like ranking. But information retrieval approaches
usually also employ the inverse document frequency idf to calculate a RSV. Furthermore, we
want to support retrieval based on the probability P (d→ q) that a document d implies a query
q is computed. For this we define a query6 q as another context. For example,
q1[ information 0.8 retrieval ]
defines a query q1 containing the terms ‘information’ and ‘retrieval’. ‘retrieval’ is weighted with
a probability of 0.8. The POLAR query
?- D->q1
returns all objects which imply this query, ranked by their decreasing implication probability.
How this value is actually calculated can differ. For example, Wong and Yao (1995) show
how probabilistic inference can be interpreted to realise well-known retrieval functions like the
vector space model. Rölleke (1998) presents how this interpretation can be applied to POOL.
In principle, it is possible to assimilate this solution for POLAR as well. In Section 6.2.4 we
will discuss further retrieval functions based on the implication probability, which are able to
produce a tf ×idf -like ranking. In the remainder of this chapter, in particular in Section 4.2.6.2,
we apply a very simple estimation for P (d→ q), which takes for each query term the product
of the term’s tf within d and q, and its idf value.
In order to allow for the integration of idf -like values, POLAR introduces term spaces. For
example,
0.5 ◦retrieval
says that the probability of the term ‘search’ (which can be based on the inverse document
frequency, depending on the application) is 0.5.
6“Query” is meant in a retrieval sense here, not to be mixed up with a POLAR query prefixed by “?-”
Page 74
hidden
64 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
targets are considered as subcontexts. If we access a1 from d1 (or in general a subcontext
from its supercontext), we create an augmented context d1(a1). Augmented contexts aggregate
the information in all contained contexts. For example, knowledge augmentation propagates
all propositions in the subcontexts to the augmented context, according to their respective
access probabilities. In our example above, ‘retrieval’ is known in d1 with 0.6 probability,
while ‘search’ is completely unknown in this context. But if we access a1 from d1, which
we do with 0.8 probability, there is a further probability of 0.6 that the context a1 knows
about ‘search’. The probability that the augmented context d1(a1) knows about ‘search’ is
determined by the probability that d1 knows about search or we access a1 from d1 and a1 knows
about ‘search’. Therefore, d1(a1) knows about ‘search’ with a probability of 0.8 · 0.6 = 0.48.
Relevance augmentation, on the other hand, propagates retrieval status values. For both kinds
of augmentation, we have to care about some peculiarities of the different context types. For
example, meta annotations are ignored as subcontexts for augmentation, since otherwise we
would mix information on the content and the meta level.
Before we discuss knowledge and relevance augmentation any further, we define the notion
of augmented context expressions.
Definition 3 (Augmented context expression):
We call expressions like d1(a1), where we denote that d1 is the context to augment and
a1 is a (not further specified) subcontext, an augmented context expression. Augmented
context expressions can be nested, so that d1(a1(a2)) means that a1 is a subcontext of d1
and a2 a subcontext of a1. The subcontext relation is transitive, so a2 is a subcontext of
d1 as well. Here, a1 is the direct subcontext of d1. Contexts can have more than one direct
subcontext. In the augmented context expression, these are separated by commas. For
instance, d1(a1,a2(a3,a4)) means that both a1 and a2 are direct subcontexts of d1, and a3
and a4 are direct subcontexts of a2.
We now present the two augmentation strategies, knowledge and relevance augmentation. We
start with knowledge augmentation and present examples in order to illustrate the approach.
Further discussions of both augmentation approaches can be found in the two subsequent
chapters.
4.2.6.2 Knowledge Augmentation
Some examples shall discuss the effect of knowledge augmentation w.r.t. the certain POLAR
subcontext types and how results are computed.
Fragments Recall the previous example:
d1[ 0.5 information 0.6 retrieval
0.7 digital 0.3 libraries
0.8 f1|| 0.9 digital 0.5 libraries 0.7 *a1|| ]
d2[ 0.5 libraries ]
a1[]
A query for documents about ‘libraries’ without knowledge augmentation yields:
Page 75
hidden
4.2 The POLAR Framework 65
?- D[ libraries ]
0.5 (d2)
0.3 (d1)
due to the weight of ‘libraries’ in d1 and d2 which represents the probability that this term is
true in d1 and d2. Now we pose the same query, but with knowledge augmentation:
?- //D[ libraries ]
0.58 (d1) # from d1(a1,f1)
0.5 (d2) # from d2
The “//” tells the system to perform knowledge augmentation. What happened here? The
knowledge of d1 is augmented with the knowledge we find in its fragment f1. The fact that this
fragment has been annotated (otherwise the fragment would not exist) makes this fragment an
important part of d1, since the annotator spent some time to annotate it. Our claim is that
such annotated passages are implicitly highlighted (this is especially true for fragments which
are explicitly highlighted, e.g., by marking oder underlining them). The more users annotate
a specific passage (implicitly or explicitly), the more we get an n-way-consensus (Marshall,
1998) that this passage has some value in it. The hypothesis is that we thus receive additional
evidence that the corresponding document should be indexed with the propositions (terms)
in the annotated part, resulting in higher probability of these propositions in d1(f1). So the
effect of knowledge augmentation with fragments is that no new terms are introduced, but the
weights of existing ones in the augmented context are raised, causing a different ranked result
in the example above. The new weight for ‘libraries’ represents the probability that the term is
true in the augmented context d1(a1,f1), which is the sum of the probabilities of the following
four cases:
• ‘libraries’ is true in d1 and we do not access/consider f1 (probability 0.3 ·(1−0.8) = 0.06);
• ‘libraries’ is true in d1 and we access f1 and the term is unknown in f1 (0.3·0.8·0.5 = 0.12);
• ‘libraries’ is true in d1 and we access f1 and the term is true in f1 (0.3 · 0.8 · 0.5 = 0.12);
• ‘libraries’ is unknown in d1 and we access f1 and the term is true in f1 (0.7·0.8·0.5 = 0.28);
The sum of the probabilities of these disjoint events is 0.06 + 0.12 + 0.12 + 0.28 = 0.58. It
represents the probability that an event occurs which makes ‘libraries’ true in d1(f1). This
probability can alternatively be calculated as
P (
=0.3
︷ ︸︸ ︷
‘libraries’ true in d1 OR
=0.8·0.5=0.4
︷ ︸︸ ︷
f1 accessed from d1 and ‘libraries’ true in f1) =
0.3 + 0.4− 0.3 · 0.4 = 0.58
with the inclusion-exclusion formula (see Definition 24 on page 133).
Note that we use the augmented context d1(a1,f1) instead of d1(f1(a1)), as the structure
indicates it. This has two reasons: first, due to fragment permeability, if a1 annotates f1 and
f1 is a fragment of d1, a1 also annotates d1. Second, since we expand d1 with a1 directly, we
do not consider a1 any more when accessing it from f1.
Page 77
hidden
4.2 The POLAR Framework 67
Positive Content Annotations With fragments and merged targets, we considered objects
created during the annotation process for knowledge augmentation. A natural step further is
to take the content of annotations into account as well.
Consider the following POLAR program (imagine for example a document about ‘soccer’
and an annotation saying that ‘soccer’ is called ‘football’ in Europe):
document(d1)
annotation(a1)
d1 [ 0.6 soccer
0.7 *a1 ]
a1 [ 0.5 football ]
The query for documents about football
?- D[ football ] & document(D)
would not retrieve d1, although (for Europeans) it would be relevant. The query
?- //D[ football ] & document(D)
considers the term ‘football’ in a1 and would thus retrieve d1 with a probability of 0.7·0.5 = 0.35
due to the association of d1 with ‘football’ in d1(a1).
An interesting application of knowledge augmentation is the handling of contradictions, which
often occur in annotations and especially discussions. Consider the following situation:
d1[ *a1 *a2 ]
a1[ moon_made_of_cheese ]
a2[ !moon_made_of_cheese ]
Annotation a1 states that the moon is made of cheese, and a2 says it is not. A reader
of d1 would not get any information about what the moon is made of at all; but if
she considers the annotations as well, she would get inconsistent information about the
moon being made of cheese; neither the query “?- //D[moon_made_of_cheese]” nor
“?- //D[!moon_made_of_cheese]” would return d1, because ‘moon_made_of_cheese’
is inconsistent in d1(a1,a2)7.
We extend the example above with probabilities:
d1[ *a1 *a2 ]
a1[ 0.8 moon_made_of_cheese ]
a2[ 0/0.7 moon_made_of_cheese ]
The query
?- //D[ moon_made_of_cheese ]
returns
0.8 (a1) # from a1
0.14 (d1) # from d1(a1,a2)
We can see here how the negative probability (0.7) in a2 influences the probability that
‘moon_made_of_cheese’ is true in d1(a1,a2). This value is calculated as 0.8 · (1 − 0.7) =
0.24. The query “?- //D[!moon_made_of_cheese]” returns d1 with a probability of
(1− 0.8) · 0.7 = 0.14.
7Note that POLAR does not offer means to query inconsistent knowledge yet. A possible extension of POLAR
might evaluate the query “?- //D[moon_made_of_cheese & !moon_made_of_cheese]” in a way that
it returns d1 in our example. This might be interesting for tasks where one wants to explicitly search for
topics which are discussed controversially.
Page 78
hidden
68 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
Negative Content Annotations The examples presented so far all dealt with positive annota-
tions. The question is how we can handle negative annotations w.r.t. knowledge augmentation.
One option is that a proposition a appearing in a negative content annotation is propagated
as ¬a in the augmented context.
Consider the example of an annotation thread in COLLATE, which is shown in Figure 2.1
on page 13. We see the annotation, an interpretation saying that the film mentioned in the
source document was censored for political reasons. This annotation is annotated again; in
the reply a2, the annotator expresses her disagreement with the previous statement by using a
countargument annotation type and saying that she thinks there were no political reasons (“I
disagree. There were no political reasons”).
We want to model this situation in POLAR and also reflect the fact that a2 is a neg-
ative response to a1, especially regarding the topic “political reasons”. One option is to
detect in a2’s content that the “no” belongs to “political reasons”. We might then assign
“!political_reasons” to a2. The other option (for instance in case of an automatic in-
dexer which is based on terms and does not detect that “no” belongs to “political reasons”
or even treats “no” as a stop word) is to use negative polarity and create a negative content
annotation, if we assume this can be inferred from the annotation type (counterargument in
this case). This scenario is expressed in POLAR as
document(d)
annotation(A) :- interpretation(A)
annotation(A) :- counterargument(A)
interpretation(a1)
counterargument(a2)
d[ *a1 ]
a1[ 0.7 film 0.5 censored 0.8 political_reasons 0.8 -*a2 ]
a2[ 0.7 political_reasons ]
The first line says that d is a document. The second and third line mean that every interpre-
tation or counterargument is an annotation. Line 4 and 5 classify a1 and a2 as interpretation
and counterargument, respectively, which also means they are annotations. Line 6 introduces
document d, with no further (textual) content. Line 7 shows the annotation a1 and its corre-
sponding terms and term weights. In the context of a1, a2 is a negative content annotation
and is accessed with 0.8 probability. The last line shows a2. The query
?- D[ political_reasons ]
returns, without any augmentation,
0.8 (a1)
0.7 (a2)
The document d would not be retrieved.
We have two annotations a1 and a2 which are both about ‘political reasons’, but a2 talks
negatively about a1 with respect to this term and is thus a negative content annotation in the
context of a1. We interpret this situation that a2 attacks the fact that a1 is a good authority for
‘political reasons’, which means that the corresponding term weight should be decreased when
considering the augmented context a1(a2). Knowledge augmentation adds the probability that
‘political reasons’ is true in a2 to the probability that it is negative in a1(a2); this value is then
also propagated to d1(a1(a2)). So we get
Page 79
hidden
4.2 The POLAR Framework 69
?- //D[ political_reasons ]
0.7 (a2) # from a2
0.24 (a1) # from a1(a2)
0.24 (d) # from d1(a1(a2))
While a1 has a positive effect on d(a1), We see that the existence of a2 has a negative effect
on a1(a2) and therefore also d(a1(a2)). Without a2, a1 and d would be assigned a value of
0.8 instead of 0.24 (= 0.8 · (1 − 0.7), the probability that ‘political reasons’ is positive and
not negative in a1(a2) and d(a1(a2)), respectively). If there was another negative content
annotation a3 which annotated a2 and also contains the term ‘political reasons’, then a2 would
have a negative effect on a1(a2(a3)) and d(a1(a2(a3))), but a3 would in turn have a positive
effect on a1(a2(a3)) and d(a1(a2(a3))) since it has a negative effect on a2.
Implication Probability and Knowledge Augmentation As mentioned before, POLAR sup-
ports context implication for retrieval, i.e. it estimates the probability P (d → q) that a
document implies a query. This can be combined with knowledge augmentation to realise
annotation-based retrieval.
Again, we calculate the implication probability as above in Section 4.2.5.4. Besides the
weights of propositions (terms, attributes and classifications) in contexts, this method to com-
pute the implication probability also integrates the inverse document frequency. Let us extend
the above COLLATE example with some idf -based term measures, for instance
0.5 ◦political_reasons
which says that the (idf -based) term probability of ‘political reasons’ is 0.5. Then,
q[ political_reasons ]
?- D->q
calculates a retrieval status value of 0.4 for a1 (the term weight (0.8) multiplied with the
idf (0.5)) and 0.35 for a2. d1 would not be found in that case. Now we apply knowledge
augmentation. We get
?- //D->q
0.35 (a2) # from a2
0.12 (a1) # from a1(a2)
0.12 (d) # from d(a1(a2)))
As seen above, the weight for ‘political reasons’ in the augmented contexts of a1 and d is 0.24,
and 0.24 · 0.5 is the RSV of both a1 and d.
We introduced POLAR’s knowledge augmentation facilities. These take into account every
subcontext of the context to augment. Later in Section 4.3.1 we discuss further knowledge
augmentation examples and how to fine-tune the augmentation process according to subcontext
types.
4.2.6.3 Relevance Augmentation
Besides knowledge augmentation, POLAR supports another augmentation strategy which we
call relevance augmentation. In contrast to knowledge augmentation, we calculate the RSV of
Page 80
hidden
70 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
Figure 4.3: Relevance augmentation example. Arrows denote propagation.
each single object first. Relevance augmentation then means that we propagate the retrieval
status value of a context to its supercontext to create the final RSV of the augmented con-
text (see the example below). The advantage of relevance augmentation is that it operates on
retrieval status values rather than propositions. This is important for example in cases when
an annotation service employing POLAR does not have access to document fulltexts, which
are possibly stored in external repositories, to extract the required propositions for knowledge
augmentation. Such a scenario is outlined in Agosti et al. (2006) where the DiLAS annota-
tion service is presented; DiLAS is supposed to be linked to several external digital library
management systems, and these systems might only provide document handles and a search
API, but no access to the fulltexts in order to index them. With relevance augmentation,
a POLAR implementation can query these external sources in order to fetch retrieval status
values for external documents, and fusion them with annotations’ retrieval status values for
annotation-based document search (as outlined similarly in Agosti and Ferro (2005))8.
Relevance augmentation is illustrated in Figure 4.3. Let us say that for a query q1 and a
document d1, an external digital library management system returns a retrieval status value
of 0.3. Furthermore, the RSVs for a1 and a2 are 0.5 and 0.2, respectively. Let us further
assume that a1 is a content annotation and a2 is a negative content annotation of d1. The
corresponding access probabilities are 0.5. The relevance augmentation approach now combines
these three retrieval weights and generates a new context-based one for d1. The weight of a1
raises the resulting weight for d1, while the weight of a2 lowers it again, since we have negative
evidence about the relevance of d1 here. From the document itself, we know that it is relevant
with 0.3 probability. From the annotations, we have both positive and negative evidence. The
positive evidence comes from a1; together with the positive evidence from the document, we
infer with a probability of 0.3 + 0.5 · 0.5− 0.3 · 0.5 · 0.5 = 0.475 that d1 is relevant. Considering
the negative evidence from the annotations, we infer that d1 is not relevant with 0.5 · 0.2 = 0.1
probability. Relevance augmentation combines positive and negative evidence and calculates
the probability that we have positive evidence and not negative evidence from the context, that
is 0.475 · (1− 0.1) = 0.4275, which is the final context-based retrieval status value of d1.
Syntactically, the expression “?- //D->q1” could be used for relevance augmentation9. See
Section 6.2.5 for a further discussion.
8We have to be aware that the RSVs coming from external sources are not necessarily probabilities and often
need to be normalised accordingly.
9Note that, despite of the choice of the syntactic expression, relevance augmentation does not necessarily
calculate an implication probability (this also depends on the external sources and their underlying retrieval
function).
Page 81
hidden
4.3 Further Application Showcases 71
We introduced POLAR’s knowledge modelling, querying and retrieval capabilities and its
core concept, augmentation. Before discussing POLAR’s syntax and semantics formally in the
next chapter, we present some further examples of possible POLAR applications.
4.3 Further Application Showcases
The structured annotation hypertext introduced in the last chapter is a very complex data
structure containing many different components and their relations. As we have seen in Chap-
ter 2, annotations can be of many different types and can have many facets. This makes clear
that POLAR, as a framework to model structured annotation hypertexts, to query them and
to perform probabilistic retrieval on them, potentially serves a wide range of possible tasks
and applications. In the previous considerations, we have already seen some examples, when
we discussed knowledge augmention, which is able to handle negative and inconsistent knowl-
edge. We are now going to presentsome further application showcases, also to give additional
examples of POLAR programs. These single examples can of course be combined in order to
fulfil more complex tasks. All showcases have in common that they combine different kinds of
evidence coming from the structured annotation hypertext in order to determine the relevance
of documents and annotations, respectively, for document or discussion search.
4.3.1 Annotation-based Structured Document Retrieval and Discussion Search
4.3.1.1 Outline
We have discussed POLAR’s main retrieval function based on the estimation of P (d → q),
which can be used in combination with knowledge and relevance augmentation. Augmentation
is a well-known principle for structured document retrieval, where we search for a best entry
point within a structured document. From this perspective, augmentation in POLAR allows
for annotation-based structured document retrieval. Annotations help finding best entry points
in documents. Consider the situation illustrated in Figure 4.4. When we perform classical
structured document retrieval, we are interested in documents and their subparts, which are in
this case d1, s1,s2, s11, s12 and s21. Propagation and augmentation considers these subcontexts
only. For instance, just s21 would influence the RSV of s2 (by propagating its knowledge to s2
when forming s2(s21)). It we add annotations to the retrieval process (by taking a1, . . ., a7 into
account as well), s2’s RSV is also influenced by its direct annotations a3 and a6, and further
indirectly by a4 and a5. It is also indirectly influenced by a7, since the knowledge of a7 is
propagated to s21 and (with a lower resulting propagation factor) also to s2. a1, . . ., a7 are not
retrieved, but contain additional evidence for the relevance or non-relevance of the respective
subparts. Furthermore, in Fig. 4.4, a6 references d2. If we propagate the information coming
from referenced objects as well, also d2 biases the RSV of s2.
We outlined how augmentation can support annotation-based structured document search.
But of course not only structured documents can be the desired objects to retrieve, but also
annotations, possibly as entry points into a discussion thread. If we only want to retrieve
annotations, then in the example above, a3 could be augmented with a4 and a5 and the retrieval
status value of a3(a4,a5) would be calculated. Note that a3 could also possibly augmented with
content from s2 contained in a merged annotation target of a3.
4.3.1.2 Controlling the Augmentation Behaviour
The augmentation behaviour can be controlled by special commands. The expressions
Page 82
hidden
72 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
Figure 4.4: Annotation-based structured document retrieval. Grey boxes reflect subparts of
a structured document, white boxes are annotations. Arrows denote structure,
annotation and reference propagation, respectively.
_structure_propagation()
_annotation_propagation()
_reference_propagation()
say that we augment a context by its logical document structure, by annotations (including
fragments and merged targets), and by referenced objects, respectively (which is the default
behaviour). On the other hand,
_no_structure_propagation()
_no_annotation_propagation()
_no_reference_propagation()
omit the logical structure, annotations and referenced objects, respectively, from (knowledge
and relevance) augmentation. By combining these special commands, we can fine tune the set
of objects involved in augmentation. In the above example, if we say that no structure and
reference propagation should be performed, the augmented context of s2 is s2(a3(a4,a5),a6). If
we omit annotation and reference propagation, s2’s augmented context is s2(s21). If we only
disallow reference propagation, s2(s21(a7),a3(a4,a5),a6) is the augmented context of s2. If we
allow all kinds of propagation, we gain s2(s21(a7),a3(a4,a5),a6(d2)) as the augmented context
of s2.
4.3.1.3 Example
We will now discuss an example for document and discussion search in POLAR. Consider the
following simple knowledge base consisting of annotations and structured documents:
document(d1)
subpart(s1)
annotation(a1)
annotation(a2)
d1[ 0.7 ir
0.9 s1[ 0.6 db 0.5 *a1 ] ]
a1[ 0.4 t1< 0.6 db >
0.75 is 0.6 *a2 ]
a2[ 0.4 t2< 0.75 is >
Page 83
hidden
4.3 Further Application Showcases 73
0.9 ir ]
0.5 ◦ir 1.0 ◦is 1.0 ◦db
Document d1 is about information retrieval (‘ir’) and has a section (subpart) s1 about databases
(‘db’). s1 is annotated by annotation a1 which is about information systems (‘is’). a1 is
annotated by a2, which is about information retrieval again. a1 and a2 form a toy discussion
thread. The merged target of a1 is determined by the content of its annotated object s1
(analogously for a2). ‘ir’ has an idf of 0.5, ‘is’ and ‘db’ both have an idf of 1. This knowledge
base is the basis for our further considerations on structured document IR and discussion search.
Structured Document IR We want to perform structured document retrieval. This means
our ranking should contain subparts and documents, but no annotations. Let us assume we
search for documents about information systems.
q1[ is ]
relevant1(D) :- //D->q1 & document(D)
relevant1(D) :- //D->q1 & subpart(D)
The first line defines our query, the second and third line says that every document and subpart
is relevant if its augmented context implies the query. Without knowledge augmentation, d1
and also s1 would not be retrieved, since they do not know about ‘is’. But now, we gain:
?- relevant1(D)
0.375 (s1) # from s1(a1)
0.3375 (d1) # from d1(s1(a1))
The term ‘is’ is propagated from a1 to s1 (and has a weight of 0.5 · 0.75 = 0.375 in s1(a1). It is
further propagated to d1 with a weight of 0.5 · 0.75 · 0.9 = 0.3375 in d1(s1(a1)). These values
are multiplied with the idf of ‘is’ to get the final result (yielding the same values again due to
multiplication with 1). a1 and a2 are not retrieved, although their augmented contexts (a1(a2)
and a2(t2), respectively) know about ‘is’, because they are not classified as being a subpart or
document.
Discussion Search Based on our toy knowledge base we can also perform discussion search.
Here, annotations are the focus of retrieval, and we exploit the discussion context of each
annotation. Consider a new query searching for ‘ir’; relevant are only annotations satisfying
this query. We gain:
q2[ ir ]
relevant2(A) :- //A->q2 & annotation(A)
?- relevant2(A)
0.45 (a2) # from a2(t2)
0.27 (a1) # from a1(t1,a2)
‘ir’ has a weight of 0.9 in a2 and 0.6 · 0.9 = 0.54 in a1(t1,a2)10. These values need to be
multiplied with the idf of ‘ir’, which is 0.5. Consider a third query, this time for databases:
10You may notice that a1’s augmented context is not a1(t1,a2(t2)), as we would expect it from the discussion so
far. t2 is the merged annotation target of a2 and contains content from a1, which should not be considered
again. This peculiarity of annotations w.r.t. augmentation will be subject to discussion later.
Page 87
hidden
4.3 Further Application Showcases 77
?- A->q & visible(A)
0.4 (a1)
0.3 (a2)
0.25 (a3)
a1 is retrieved since it is a shared annotation and Thomas is member of the group a1 belongs
to. a2 is retrieved because it is a private annotation and Thomas is its author. a3 is retrieved
because it is a public annotation. All annotations are visible to Thomas, so nothing is filtered.
Now consider the current user is Peter and not Thomas:
current_user(peter)
We then get:
?- A->q & visible(A)
0.4 (a1)
0.25 (a3)
a2 is not fetched since it is a private annotation and Peter is not its author. In case Harold is
the current user, only a3 would be retrieved.
4.3.5 Semantic Annotations and Ontologies
One of the advantages of logic-based frameworks is the possible integration of additional knowl-
edge into the retrieval function. Such external knowledge can consist of an ontology where
classes and objects are semantically related to each other. A simple example can be an on-
tology regarding generalisation/specialisation (or “IS-A”) relations among cities: each Hessian
city (a city located in the German federal state of Hessia) is a German city, and each German
city is a European city (similar relations can be identified for other European countries and
their cities). Our toy ontology further says that each city located in Illinois is an American
city. We can incorporate this ontology into our knowledge base as follows:
O[ german_city(C) ] :- O[ hessian_city(C) ]
O[ european_city(C) ] :- O[ german_city(C) ]
O[ american_city(C) ] :- O[ illinoisan_city(C) ]
This says that each object that is about a Hessian city is also about a German city and about
a European city. The same holds for objects about cities in Illinois, which are also objects
about American cities. Now consider that we have a document d1 about the city Darmstadt.
Furthermore, consider a categoriser which infers with 0.7 probability that the “Darmstadt”
mentioned in d1 is the Hessian city Darmstadt and with 0.3 probability that the American
city Darmstadt, Illinois, is meant, and put this knowledge into a semantic annotation a1. The
output of a document indexing and named entity recognition process could be
d1[ 0.8 darmstadt *a1 ]
a1[ 0.7 hessian_city(darmstadt) 0.3 illinoisan_city(darmstadt) ]
To search for all documents about the European city Darmstadt, we apply knowledge augmen-
tation and pose the query
?- //D[darmstadt & european_city(darmstadt)]
The fact that Darmstadt is a Hessian city is known in the augmented context d1(a1) with 0.7
probability. Due to our ontology rules above, this is also the probability that our Darmstadt in
d1 is a European city. The query returns d1 with a corresponding probability of 0.8 ·0.7 = 0.56.
Page 88
hidden
78 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
4.3.6 Social Networks
With the advent of the so-called “Web 2.0”, social community platforms (like Flickr, Library-
Thing, Last.fm and YouTube) emerged which let users submit documents (textual documents,
but also images and video). These documents can be shared among users, and, in turn, users
can annotate these documents. Annotations can be textual comments, but also so-called tags.
Collaborative tagging can be regarded as a kind of manual indexing of the document. Another
key feature of social community platforms is the ability to maintain a list of friends. Friends
lists are populated with friends a user knows from the real world, but can also be enriched with
users sharing similar interests. For example, the social music platform Last.fm12 calculates a
similarity score between users, which is based on the musical taste. Another way of calculating
the so-called friendship similarity is reported in Schenkel et al. (2008). By applying normalisa-
tion, such a score can be interpreted as a probability and thus be integrated into the POLAR
framework.
Let us consider a toy knowledge base with four users, Frank, Eva, Paul and Martin. The
friendship score is used to estimate the probability of the ‘friend’ attribute; for Frank, this
might be:
0.8 frank.friend(eva)
0.1 frank.friend(paul)
0.1 frank.friend(martin)
Eva and Frank are close friends, while Paul and Martin are more strangers to Frank. Consider
two documents, d1 and d2, which are tagged with ‘pop’ by annotations a1 (by Eva), a2 (by
Paul) and a3 (by Martin):
d1[ 0.5 @a1 ]
d2[ 0.5 @a2 0.5 @a3 ]
a1[ pop ]
a2[ pop ]
a3[ pop ]
a1.author(eva)
a2.author(paul)
a3.author(martin)
(we see tags as meta annotations and access them with a probability of 0.5). Frank now wants
all documents which are tagged with ‘pop’, and he prefers tags coming from his friends. In
POLAR, this can be expressed with the rule and query
rel_soc(D) :- D[@A] & A[pop] & A.author(U) & frank.friend(U).
?- rel_soc(D)
For Frank, the tags provided by Eva are more valuable than the ones provided by Paul or
Martin. So although d2 is tagged with ‘pop’ twice, and d1 only once, d1 is ranked ahead of d2:
0.4 d1
0.0975 d2
(0.4 = 0.5 · 0.8 and 0.0975 = 0.05 + 0.05− 0.05 · 0.05). The probability of rel_soc(d1) (resp.
rel_soc(d2)) is the social score of d1 (resp. d2) with respect to the given user (Frank).
12http://www.last.fm/
Page 89
hidden
4.3 Further Application Showcases 79
4.3.7 Ratings
Another interesting POLAR application are ratings. These often occur in commercial systems
where users can rate, for instance, products or books. For example, within Amazon, users can
rate (among other items) books and CDs by giving 0 to 5 stars (usually 5 stars means “very
good” and 0 stars means “very bad”) on a Likert scale. Such ratings are typical examples
of meta annotations. We assume a 5-tier scale and map a rating onto the probability that
the rated document is good. For this, we use the proposition ‘rated_good’; its probability is
determined by the rating: 0 means a probability of 0, 1 means a probability of 0.2, 2 means
0.4, 3 means 0.6, 4 means 0.8 and a rating of 5 means a probability of 1. (Note that this is a
simple mapping of the scale to probabilities in order to show how such ratings can be applied
in POLAR; actual applications might require a more sophisticated mapping.) Consider the
following knowledge base:
d1[ 0.7 databases 0.5 @a1 0.5 @a2 ]
d2[ 0.8 databases 0.5 @a3 0.5 @a4 ]
a1[ 0.8 rated_good ]
a2[ 0.8 rated_good ]
a3[ 0.4 rated_good ]
a4[ 0.2 rated_good ]
We now seek for books about databases which are rated good:
rated_good(D) :- D[@A] & A[rated_good]
?- D[databases] & rated_good(D)
P (rated_good(d1)) = 0.5·0.8+0.5·0.8−0.5·0.8·0.5·0.5 = 0.7 and P (rated_good(d2)) =
0.5 · 0.4 + 0.5 · 0.2− 0.5 · 0.4 · 0.5 · 0.2 = 0.28. The resulting ranking is
0.49 (d1)
0.224 (d2)
(0.49 = 0.7 · 0.7 and 0.224 = 0.8 · 0.28). d1 is ranked ahead of d2 due to the fact that it was
rated better than d2.
4.3.8 Annotation-based Trustworthiness
We previously discussed the effect of negative content annotations on the probability that a
term is true in an augmented context. We interpreted this scenario so that by knowledge
augmentation, we decrease the trust in a1 being a good source for statements about ‘political
reasons’.
Especially in public discussions the question arises whether we can trust an annotation.
Authors of annotations can simply be wrong or just talk nonsense. One measure of the trust-
worthiness is the number of positive or negative replies a comment gets. If there are mostly
negative replies, we should not trust a comment; if there are mostly positive ones, then the
comment is a trustworthy source for new information. Consider the following knowledge base:
0.7 ◦football
a1[ 0.7 football 0.5 -*a3 0.5 -@a4 ]
a2[ 0.5 football 0.5 *a5 0.5 *a6 ]
a3[] a4[] a5[] a6[]
Page 91
hidden
4.4 Related Work 81
4.3.9 Access Probability
Due to the fact that documents and annotations are not regarded as atomic objects in POLAR,
but their context determined by the document structure, annotations and referenced objects
is considered, the access probability plays a central role. When performing augmentation, the
access probability can be compared to a propagation factor which controls to which degree
terms, classifications, attributes, RSVs and their corresponding weights are propagated from
subcontexts to (augmented) supercontexts. For structure queries, the access probability is used
to provide a ranking of matching objects. Access probabilities have also been exploited for the
determination of the trustworthiness of annotations.
The estimation of access probabilities is subject to the actual application on the one hand
and, when used as a propagation factor for augmentation, subject to experiences made in
experiments on the other hand. There are two basic views on the determination of access
probabilities:
• In the user-centric view, access probabilities may be influenced based on user (i.e. reader)
preferences. For example, a user may not want to consider annotations made by certain
authors, or she gives a certain author more priority and therefore raises the access prob-
ability to annotations by this author.
• In the system-oriented view, the access probability is not determined by a user, but
is based on statistics or a certain underlying model. For instance, the random surfer
model (Page et al., 1998) assumes that Web links are randomly accessed with the same
probability, which is determined by the number of links. To adopt a similar behaviour to
POLAR and annotations, we may calculate the probability that a accesses its successor
a′ as
P (acc(a, a′)) =
1
#annotations
where #annotations is the number of annotations of a. A further estimation of access
probabilities is derived from experiments where we try to determine for which global
access probability a retrieval functions yields the best results.
Both views might be mixed; the system might perform an initial estimation of the access
probabilities, and then the user biases this value based on her preferences.
4.4 Related Work
By providing means to calculate P (d → q), POLAR follows the notion of retrieval as proba-
bilistic inference proposed by van Rijsbergen (1986). As shown in Fig. 4.5, POLAR combines
concepts and methods from areas such as hypertext and structured document (XML) retrieval
(including modelling and querying of complex objects) and discussion search. Naturally, PO-
LAR is related to other work from annotation-based IR.
4.4.1 Hypertext and Structured Document IR and Discussion Search
4.4.1.1 Hypertext IR
Graph-based Approaches Lucarella and Zanzi (1996) propose a graph-based object model
for hypermedia documents. This model deals with objects and classes, attributes, and proper-
ties. Schema graphs can be defined on the object level; instance graphs are based on schema
Page 92
hidden
82 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
Figure 4.5: POLAR and related work
graphs on the object level. Retrieval is supported by means of so-called perspectives, which
are subgraphs of the schema and instance graph; certain operations are offered to let the user
specify conditions which objects of a select class have to meet. Other operations allow for
object access and the combination of perspectives. Another interesting approach is reported by
Chiaramella and Kheirbek (1996) who propose an integrated model for hypermedia and infor-
mation retrieval based on conceptual graphs. Content knowledge contains concept types (the
domain knowledge including generalisation/specialisation relations among concepts) for index-
ing documents, and structural knowledge contains the logical document structure of objects as
well as possible relations among them. Complex queries can be posed to the underlying graph
structure, and P (d → q) is calculated during query evaluation in order to create a ranking of
documents. Both graph-based approaches are interesting for structured annotation hypertexts
and annotation-based retrieval, since very complex structure and content queries are supported.
Both approaches are different than the one presented here; for example augmentation, as we
know it from POLAR, is not supported.
Propagation-based Approaches Another type of hypertext retrieval approaches takes neigh-
bouring nodes (i.e., documents) into account when calculating a final RSV for a given node. The
following approaches have in common that first a RSV for each node is calculated (applying,
for example, a retrieval function based on tf × idf ), which, similar to relevance augmentation,
is then propagated to the node whose final weight has to be determined. Frisse (1988) adds the
normalised final weight of direct neighbours (which again consume the weight of their direct
neighbours) to create a node’s final weight. Frei and Stieger (1994) refine Frisse’s approach
by introducing the concept of spreading activation. Constrained spreading activation is based
on a sophisticated link description which takes the link type, the content of the destination
node and it neighbouring nodes, and link annotations into account. The similarity of a link
description to the query is used in a decision phase to decide whether a link (and probably
its subsequent ones) should be followed or not. If a link is followed, the normalised RSV of
the destination node is added to the RSV of the original node using a propagation factor. In
contrast to that, weighted spreading activation does not know a decision phase (all links are
followed), but uses the similarity of the link to the query as an additional factor to be added to
the final RSV. Both variants of spreading activation can utilise relevance feedback to improve
Page 93
hidden
4.4 Related Work 83
link descriptions. Experiments show that both spreading activation approaches outperform
retrieval methods not considering any neighbouring nodes. Spreading activation motivated the
relevance augmentation approach in POLAR.
Inference-based Approaches Certain other retrieval models include the structural context
(logical structure, hypertext links or thread structure) in different ways to produce a ranking.
For example, Croft and Turtle (1989) propose a retrieval model based on Bayesian inference
networks for hypertext retrieval. Nodes represent documents, concept and the query, as in
the standard Bayesian inference model reported in Turtle and Croft (1990). In order to deal
with various links in hypertexts as well as thesaurus relations, dependencies between nodes
(representing hyperlinks) and concepts (representing thesaurus links) are introduced to the
inference network. A deductive database approach based on probabilistic logics and possible
world semantics is probabilistic Datalog (Fuhr, 2000). Hypertext links can be considered in
form of rules, whereas probabilistic facts represent weighted index terms.
Web IR Beitzel et al. (2003) describe an approach which makes use of a rich representation of
Web documents, which includes, among others, the full text, titles, headers and anchors text of
referring pages. The evidence coming from these components is fusioned to calculate a combined
RSV. The approach is similar to the idea of augmentation applied here. Especially anchor texts
are interesting, since they are similar to annotations due to the fact that they directly refer
to the page a link points to. Following the ideas of Page et al. (1998) and Kleinberg (1998),
where the link structure is mined to extract evidence if a page is a good authority, also POLAR
basically supports the exploitation of such non-topical evidence found in the link structure.
Further hypertext approaches and a thorough discussion on hypertext IR can be found in
Agosti and Smeaton (1996).
4.4.1.2 Structured Document Retrieval
The hypertexts which are the underlying data structure of the approaches mentioned above
are not necessarily reflecting the logical structure of documents. Above approaches can thus
be applied on a wide range of hypertext containing structural links as well as inter-document
links (like, e.g. bibliographic references). In the 90s and especially with the emerge of XML as
a language to represent structured documents as well as data items, more methods focussed on
structured document retrieval.
Structured documents are hypertexts as well, but with certain peculiarities. The report by
Chiaramella et al. (1996) presents a model for hypermedia documents (sometimes also referred
to as complex objects) most approaches for structured document and XML retrieval are based
upon. One important aspect of structured documents is that document components or subparts
are nodes in a hypertext connected by structural links reflecting the logical document struc-
ture. This way, such links establish an aggregation relation. Consider, for example, this thesis
which consists of parts, chapters, sections, subsections, subsubsections and finally paragraphs,
code snippets, figures and tables. Chapters aggregate sections and paragraphs, parts aggregate
chapters, and the whole thesis is an aggregation of chapters and parts. So in the hypermedia
model, the thesis as well as each part, each chapter etc. would be nodes in a hypertext which
are connected by structural links. Besides structural links, the model by Chiaramella et al. also
allows for other (intra- and inter-document) links; examples are references to other document
parts or the bibliographic references pointing to external documents. The document model in
Page 94
hidden
84 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
POLAR, where we define documents as contexts, basically follows and extends this hypermedia
model. Both documents and annotations are regarded as structured documents whose aggrega-
tion is modeled by the subpart relation. Knowledge augmentation aggregates such subcontexts
into supercontexts. POLAR goes beyond structured document retrieval by also incorporating
annotations into the aggregation.
As outlined before, POLAR is strongly related to POOL (Rölleke and Fuhr, 1996; Fuhr
and Rölleke, 1998; Rölleke, 1998; Fuhr et al., 1998). POOL is targeted at modelling complex
hypermedia objects and providing means for structured document retrieval. POLAR integrates
all POOL concepts like four-valued logics, probabilistic logics, object-oriented modelling with
terms, classifications and attributes, as well as aggregation of objects by means of access to
subcontexts allowing for knowledge augmentation.
Another knowledge modelling approach, also targeted at hypermedia and structured doc-
ument retrieval, is MIRTL (Meghini et al., 1993). MIRTL is based on terminological logic.
It is good in representing complex objects and complex queries, but lacks direct support of
important IR features like term weights and the calculation of a retrieval status value.
In recent years, many approaches for structured document retrieval were developed which
utilise the XML representation of the document base. The logical structure of documents
is given by the document type definition (DTD) or an XML schema definition; XML docu-
ments basically adhere to the hypermedia model above. XML retrieval emerged as a branch
of structured document retrieval. Many standard retrieval approaches were adapted in order
to consider the structural context of documents. For instance, Piwowarski et al. (2003) utilise
Bayesian inference networks for structured document retrieval to find a best entry point within
the document structure. They apply a flat retrieval function which calculates an initial RSV for
each document node and then use Bayesian inference to determine a best entry point. Other
approaches use language models for structured document retrieval (see, e.g., Ogilvie and Callan
(2004)). Abolhassani and Fuhr (2004) extend the divergence from randomness approach by in-
corporating a so-called third normalisation which introduces the level of a node in a structured
documents (with a level of 1 for root node) as an additional parameter to the Inf2 function.
This makes the risk of accepting a term higher for lower levels, penalising nodes which are not
specific. Fuhr and Großjohann (2004) present XIRQL, an XML query language which addresses
the problem that common XML query languages do not offer any support for IR-oriented XML
querying. Within XIRQL, it is possible to pose content-and-structure queries to XML docu-
ments. Specificity-oriented search, aiming at returning only the most specific document nodes,
is supported by an augmentation strategy very similar to POLAR’s knowledge augmentation.
More approaches for structured document (especially XML) retrieval are reported in the
proceedings of INEX, the Initiative for the Evaluation of XML Retrieval (Fuhr et al., 2003,
2004, 2005, 2006, 2007). These proceedings also contain a thorough discussion about the task
itself and suitable evaluation measures.
4.4.1.3 Discussion Search
Discussion threads are made of newsgroup articles, emails or, as we have shown before, nested
annotations. In Section 4.1.4.1 we therefore identified discussion search as an important task
within annotation-based IR. The goal of discussion search is to find documents (newsgroup
articles, emails or annotations) whose own content satisfies the query. Focussing on a doc-
ument’s “own content” means that parts belonging to previous messages (in form of quoted
text in emails of newsgroup articles) are not considered as part of the document. Because PO-
Page 95
hidden
4.4 Related Work 85
LAR also aims at supporting annotation-based discussion search, it is worth looking at related
discussion search approaches.
Xi et al. (2004) propose a feature-based discussion search approach based on linear regression
and support vector machines, respectively. A message’s content features (9 in total) comprise
the content of the message itself, the title’s content, the content of the root of the thread,
the ancestor, children, etc. For all these features, a ranking score is calculated according to
3 different ranking functions, so a document description contains 27 ranking scores (one score
for each combination of content features and ranking function). Some further features like the
number of descendants are added to the document description, together with certain author-
dependent features. Support vector machines and linear regression is used to train a function
which returns the final RSV for a message. While feature-based methods usually performed
well in IR, and the present one is no exception, their disadvantage is that they need a training
sample (documents with training queries and corresponding relevance judgements) to learn the
desired function.
Another approach not relying on a training sample utilises language models and thread-
based query expansion for discussion search (Balog et al., 2006). Discussion search was a
major task in the Enterprise Track of the Text Retrieval Conference (TREC) in 2005 and 2006.
The corresponding TREC proceedings (Voorhees and Buckland, 2005, 2006) introduce further
discussion search methods.
4.4.2 Annotation-based IR
While the approaches explained before can potentially be applied for annotation-based docu-
ment and discussion search, none of them addresses annotations directly. In fact, surprisingly
few annotation-based retrieval methods came up during the last decade.
Fraenkel and Klein (1999) identify annotations as an important source for retrieving relevant
texts, as they show with examples coming from the Bible and the Talmud. Their focus is on
the question how annotations are properly embedded in the main text in order to allow for
proximity search or, in other words: “How should we measure the distance between a word in
an embedded annotation and a word that occurs later in the main text?”. They discuss three
possible alternatives and corresponding implementation issues.
Golovchinsky et al. (1999) derive and expand full-text queries with terms contained in anno-
tated passages (the fragments in POLAR), based on the assumptions that these passages reflect
users’ interests more accurately. They compared the annotation-generated queries (considering
annotated fragments only) with those expanded by relevance feedback (considering whole rel-
evant documents). Experiments show that annotation-generated queries perform better than
those derived by relevance feedback. This result confirms the assumption that annotated frag-
ments reflect users’ interests quite well, and it pretty much motivated the introduction of
fragments into the POLAR framework.
Agosti and Ferro (2005) describe an approach for annotation-based document retrieval using
fusioning techniques. The underlying data structure is similar to the structured annotation
hypertext presented in Chapter 3; in fact, the definition of a structured annotation hypertext
was inspired by Agosti and Ferro’s definition of a document-annotation hypertext. In their
approach, a compound similarity score for annotations with respect to the given query is cal-
culated first, which recursively combines the score of an annotation itself and the score of its
successor nodes in the annotation thread. This score is used to calculate a document score
based on annotations alone. The annotation service knows all about annotations, their content
and metadata, and to which document they belong to, but nothing about the document content
Page 96
hidden
86 4 Annotation-based Knowledge Modelling and Retrieval with POLAR
itself. To get the content-based score of a document, the query is passed to the corresponding
digital library management system, which returns a document score using its own retrieval
routines. The document’s annotation-based score is then combined with its content-based
score by applying certain fusioning techniques. The data fusioning approach is supposed to be
integrated in the Flexible Annotation Service Tool (FAST), which offers an API for annotation-
based document retrieval as well as basic database- and content-oriented annotation querying
functions (Agosti and Ferro, 2006). Once integrated into FAST, the data fusioning approach
can handle the realistic situation that annotation services and document management services
are distributed and annotations are stored independently from the document repository. How-
ever, the approach lacks means to deal with some peculiarities of annotations, like positive and
negative annotations or meta and content ones. As an outlook, POLAR might be a candidate to
implement some of the database- and content-oriented annotation querying methods described
in (Agosti and Ferro, 2006).
Cabanac et al. (2007) present an interesting annotation approach which focusses on the social
validation of annotations. The underlying assumption is that an annotation makes sense when a
social group judges so. Such judgements are more or less implicitly contained within annotation-
based discussion threads. An annotation is socially validated when its arguments are confirmed
or refuted. The social validity measures the degree of confirmation or refutation. This value
is recursively calculated with the replies to an annotation and their social validity. It utilises
an in-depth analysis of objective and subjective data about annotations, deriving important
measures like the annotator’s expertise, whether she provided references and how many, the
comment type (modification, question or example) and the semantics of the annotator’s opinion
(confirmation and refutation). Cabanac et al. discuss three different methods to calculate social
validity. Social validity can support annotation-based retrieval in different ways; for example,
the content of a positive and socially validated annotation qualifies the annotated objects
as being trustworthy, whereas negative annotations attack the importance of the annotated
objects. POLAR basically supports these mechanisms: in augmentation, negative annotations
decrease term weights or retrieval status values, respectively, of the annotated objects, while
positive annotations increase them. We further considered positive (confirming) and negative
(refuting) annotations when talking about the trustworthiness of annotations in Section 4.3.8.
Nevertheless, the model proposed by Cabanac et al. provides a more sophisticated view on
social validity which can potentially be covered by modelling appropriate rules and facts in
POLAR.
4.5 Summary and Discussion
This chapter presented annotation-based knowledge modelling, querying and retrieval with
the POLAR framework. First an overview of information retrieval in general was given, and
then the POLAR framework was presented. POLAR is an extension of POOL; while the
latter focusses on modelling complex objects and on structured document retrieval, the former
extends this by introducing new concepts known from structured annotation hypertexts. In
POLAR, documents and annotations are complex objects which establish so-called contexts.
Propositions (terms, attributes and classifications) can be made within such contexts. For
example, the occurrence of a term in a document is a proposition made in the document’s
context. POLAR copes with different subcontext types, established by subparts, annotations,
fragments, merged targets and references. Within the global database contexts, assertions can
be made about objects (e.g., metadata) and global classes and IS-A relations can be modeled.
Page 97
hidden
4.5 Summary and Discussion 87
POLAR offers certain expressions to query the underlying structured annotation hypertext. It
further supports the calculation of the implication probability P (d→ q). POLAR can deal with
four truth values, which is important for example to reflect contradictions in discussions. A
further important annotation attribute supported by POLAR is the polarity. One of POLAR’s
core concepts is knowledge augmentation, where a context is extended with the knowledge in its
subcontexts. For instance, the context of a document can be augmented with its annotations.
Knowledge augmentation combined with calculating P (d → q) allows for annotation-based
structured document retrieval on the one hand and is a tool for discussion search on the other.
Similar to knowledge augmentation, relevance augmentation propagates retrieval status values
from subcontexts to supercontexts. This strategy is more suitable if there is no access to the
document knowledge in distributed sources, but only to retrieval status values w.r.t. the given
query. Application examples show the flexibility and expressiveness of POLAR, its powerful
querying and retrieval facilities and the ability to exploit even non-topical information contained
in structured annotation hypertexts. While POLAR is aimed at annotation-based retrieval, it
combines and extends concepts known from hypertexts and structured document retrieval as
well as discussion search.
With POLAR, a framework is created which supports various information needs arising when
dealing with annotations. The goal was not only to define yet another solution to retrieve docu-
ments with the help of annotations (although this is the main focus of POLAR), but to provide
advanced knowledge modelling and querying mechanisms. A logic-based solution was chosen
since it allows for easy integration of additional knowledge into the retrieval process, as well
as modelling and querying the rich representation established by annotations and documents.
It further allows for the creation of flexible retrieval functions which are able to fully exploit
the information contained in structured annotation hypertexts which would otherwise be ig-
nored by “thinner” retrieval approaches. We can not just reuse POOL for the task at hand
since by nature, POOL can only deal with trees as we find them when dealing with the logical
document structure. Modelling structured annotation hypertexts in POOL is not impossible,
but would be a tedious task. Only subsets of structured annotation hypertexts which have
a tree-like structure (e.g., documents with connected annotation threads, without references
and only allowing one annotation target) are supported by POOL directly. A document and a
connected annotation thread would then be regarded as a complex object in POOL. But even
then certain other crucial concepts like negative annotations or the distinction between content
and meta level annotations could only be supported indirectly with POOL (if at all).
The analogy of knowledge augmentation is that of a reader who reads the content of an
annotation or document and likes to know more. If it is not clear what the annotation is about,
the reader would refer the annotated parts to grasp the context an annotation was made in.
Knowledge augmentation does the same by accessing merged targets. The reader would access
content annotations to get different opinions about the topics or additional information or
interpretations – this is analogously to knowledge augmentation accessing annotations. The
reader would probably also consider highlighted parts as more important than non-highlighted
ones, so these fragments would raise her attention in a natural way. POLAR tries to reflect
this by accessing fragments during knowledge augmentation. If another object is referenced, a
reader would access it as well with a certain probability to gather additional information. This
is similar to knowledge augmentation following links to referenced objects. The exception are
meta annotations, which are excluded from knowledge augmentation since this would mix up
different levels – the meta and the content level.
Page 100
hidden
90 5 POLAR Syntax and Semantics
program = clause {program};
clause = fact | query | rule | context | predicate;
context = obj-id "["
{mergedtarget} contextprogram "]";
mergedtarget = {weight} obj-id "<" factlist ">";
contextprogram = contextclause {contextprogram};
contextclause = fact | subpart | annoref | reference
| fragment;
subpart = {weight} obj-id "[" contextprogram "]";
fragment = {weight} obj-id "||" factlist {annoref} "||";
factlist = fact {factlist};
fact = proposition
| "!" proposition
| weightlist proposition;
annoref = contentannoref | metaannoref;
contentannoref = {weight} "*" obj-id
| {weight} "-*" obj-id;
metaannoref = {weight} "@" obj-id
| {weight} "-@" obj-id;
reference = {weight} "=>" obj-id;
proposition = term
| classname "(" obj-id ")"
| obj-id.attr-name "(" constant ")";
predicate = {weight} term-predicate
| {weight} class-predicate
| {weight} attr-predicate;
term-predicate = "◦" term;
class-predicate = "◦◦" classname;
attr-predicate = "◦◦◦" attr-name;
constant = obj-id | number | string;
weightlist = weight
| weight "/" weight
| weight "/" weight "/" weight
| weight "/" weight "/" weight "/" weight;
Figure 5.1: Basic POLAR syntax. See Fig. 5.2 for rules and queries.
for the probability of a proposition to be true, the second weight for false, the third one for
inconsistent and the last one for unknown.
Each POLAR object is modelled as a context. All such contexts consist of an object ID
identifying the object or clause, respectively. Each context may contain a merged target. A
merged target is a special context containing facts appearing in each target an annotation has.
Like any other context, a merged target is uniquely identified by an object ID.
There is a specific context program for each context, which is different from general pro-
grams. A context program contains context clauses, which can be facts, subparts, references
to annotations (so-called annorefs), references to other objects, or fragments. Subparts are the
subcomponents of structured objects; they do not have any merged target.
A fragment is another special context which is identified by an object ID. Since fragments
are areas created during the process of annotation, they are either linked to an annotation in
case they are the target of this annotation, or they can be the object a reference points to.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

4 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
50% Ph.D. Student
 
25% Post Doc
 
25% Researcher (at an Academic Institution)
by Country
 
50% United Kingdom
 
25% Germany
 
25% Finland