Sign up & Download
Sign in

Measuring Conceptual Similarity by Spreading Activation over Wikipedia ’ s Hyperlink Structure

by Stephan Gouws, G-j Van Rooyen, Herman A Engelbrecht
Current (2010)

Cite this document (BETA)

Available from www.aclweb.org
Page 1
hidden

Measuring Conceptual Similarity by Spreading Activation over Wikipedia ’ s Hyperlink Structure

Proceedings of the 2nd Workshop on “Collaboratively Constructed Semantic Resources”, Coling 2010, pages 46–54,
Beijing, August 2010
Measuring Conceptual Similarity by Spreading Activation over
Wikipedia’s Hyperlink Structure
Stephan Gouws, G-J van Rooyen, and Herman A. Engelbrecht
Stellenbosch University
{stephan,gvrooyen,hebrecht}@ml.sun.ac.za
Abstract
Keyword-matching systems based on
simple models of semantic relatedness
are inadequate at modelling the ambigu-
ities in natural language text, and cannot
reliably address the increasingly com-
plex information needs of users. In
this paper we propose novel methods
for computing semantic relatedness by
spreading activation energy over the hy-
perlink structure of Wikipedia. We
demonstrate that our techniques can
approach state-of-the-art performance,
while requiring only a fraction of the
background data.
1 Introduction
The volume of information available to users
on the World Wide Web is growing at an
exponential rate (Lyman and Varian, 2003).
Current keyword-matching information retrieval
(IR) systems suffer from several limitations,
most notably an inability to accurately model
the ambiguities in natural language, such as syn-
onymy (different words having the same mean-
ing) and polysemy (one word having multiple
different meanings), which is largely governed
by the context in which a word appears (Metzler
and Croft, 2006).
In recent years, much research attention has
therefore been given to semantic techniques of
information retrieval. Such systems allow for
sophisticated semantic search, however, require
the use of a more difficult-to-understand query-
syntax (Tran et al., 2008). Furthermore, these
methods require specially encoded (and thus
costly) ontologies to describe the particular do-
main knowledge in which the system operates,
and the specific interrelations of concepts within
that domain.
In this paper, we focus on the problem of
computationally estimating similarity or related-
ness between two natural-language documents.
A novel technique is proposed for comput-
ing semantic similarity by spreading activation
over the hyperlink structure of Wikipedia, the
largest free online encyclopaedia. New mea-
sures for computing similarity between individ-
ual concepts (inter-concept similarity, such as
“France” and “Great Britain”), as well as be-
tween documents (inter-document similarity)
are proposed and tested. It will be demonstrated
that the proposed techniques can achieve compa-
rable inter-concept and inter-document similar-
ity accuracy on similar datasets as compared to
the current state of the art Wikipedia Link-based
Measure (WLM) (Witten and Milne, 2008) and
Explicit Semantic Analysis (ESA) (Gabrilovich
and Markovitch, 2007) methods respectively.
Our methods outperform WLM in computing
inter-concept similarity, and match ESA for
inter-document similarity. Furthermore, we use
the same background data as for WLM, which is
less than 10% of the data required for ESA.
In the following sections we introduce work
related to our work and an overview of our
approach and the problems that have to be
solved. We then discuss our method in detail and
present several experiments to test and compare
it against other state-of-the-art methods.
46
Page 2
hidden
2 Related Work and Overview
Although Spreading Activation (SA) is foremost
a cognitive theory modelling semantic mem-
ory (Collins and Loftus, 1975), it has been ap-
plied computationally to IR with various lev-
els of success (Preece, 1982), with the biggest
hurdle in this regard the cost of creating an as-
sociative network or knowledge base with ad-
equate conceptual coverage (Crestani, 1997).
Recent knowledge-based methods for comput-
ing semantic similarity between texts based on
Wikipedia, such as Wikipedia Link-based Mea-
sure (WLM) (Witten and Milne, 2008) and Ex-
plicit Semantic Analysis (ESA) (Gabrilovich and
Markovitch, 2007), have been found to out-
perform earlier WordNet-based methods (Bu-
danitsky and Hirst, 2001), arguably due to
Wikipedia’s larger conceptual coverage.
WLM treats the anchor text in Wikipedia arti-
cles as links to other articles (all links are treated
equally), and compare concepts based on how
much overlap exists in the out-links of the arti-
cles representing them. ESA discards the link
structure and uses only the text in articles to de-
rive an explicit concept space in which each di-
mension represents one article/concept. Text is
categorised as vectors in this concept space and
similarity is computed as the cosine similarity of
their ESA vectors. The most similar work to ours
is Yeh (2009) in which the authors derive a graph
structure from the inter-article links in Wikipedia
pages, and then perform random walks over the
graph to compute relatedness.
In Wikipedia, users create links between arti-
cles which are seen to be related to some degree.
Since links relate one article to its neighbours,
and by extension to their neighbours, we ex-
tract and process this hyperlink structure (using
SA) as an Associative Network (AN) (Berger
et al., 2004) of concepts and links relating them
to one another. The SA algorithm can briefly
be described as an iterative process of propagat-
ing real-valued energy from one or more source
nodes, via weighted links over an associative net-
work (each such a propagation is called a pulse).
The algorithm consists of two steps: First, one
or more pulses are triggered, and second, ter-
mination checks determine whether the process
should continue or halt. This process of acti-
vating more and more nodes in the network and
checking for termination conditions are repeated
pulse after pulse, until all termination conditions
are met, which results in a final activation state
for the network. These final node activations
are then translated into a score of relatedness be-
tween the initial nodes.
Our work presents a computational imple-
mentation of SA over the Wikipedia graph.
We therefore overcome the cost of produc-
ing a knowledge base of adequate coverage by
utilising the collaboratively-created knowledge
source Wikipedia. However, additional strate-
gies are required for translating the hyperlink
structure of Wikipedia into a suitable associative
network format, and for this new techniques are
proposed and tested.
3 Extracting the Hyperlink Graph
Structure
One article in Wikipedia covers one specific
topic (concept) in detail. Hyperlinks link a page
A to a page B, and are thus directed. We
can model Wikipedia’s hyperlink structure us-
ing standard graph theory as a directed graph G,
consisting of a set of vertices V, and a set of
edges E. Each edge eij ∈ E connects two ver-
tices vi, vj ∈ V. For consistency, we use the
term node to refer to a vertex (Wikipedia article)
in the graph, and link to refer to an edge (hyper-
link) between such nodes.
In this model, each Wikipedia article is seen
to represent a single concept, and the hyperlink
structure relates these concepts to one another. In
order to compute relatedness between two con-
cepts vi and vj , we use spreading activation and
rely on the fundamental principle of an associa-
tive network, namely that it connects nodes that
are associated with one another via real-valued
links denoting how strongly the objects are re-
lated. Since Wikipedia was not created as an as-
sociative network, but primarily as an online en-
cyclopaedia, none of these weights exist, and we
will have to deduce these (see Fan-out constraint
in Section 4).
47

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

7 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
57% Ph.D. Student
 
14% Student (Bachelor)
 
14% Senior Lecturer
by Country
 
14% China
 
14% Netherlands
 
14% South Africa