A Self-Training Approach for Resolving Object Coreference on the Semantic Web Categories and Subject Descriptors
- ISBN: 9781450306324
Abstract
An object on the SemanticWeb is likely to be denoted with multiple URIs by different parties. Object coreference reso- lution is to identify equivalent URIs that denote the same object. Driven by the Linking Open Data (LOD) initiative, millions of URIs have been explicitly linked with owl:sameAs statements, but potentially coreferent ones are still consider- able. Existing approaches address the problem mainly from two directions: one is based upon equivalence inference man- dated by OWL semantics, which finds semantically corefer- ent URIs but probably omits many potential ones; the other is via similarity computation between property-value pairs, which is not always accurate enough. In this paper, we pro- pose a self-training approach for object coreference resolu- tion on the Semantic Web, which leverages the two classes of approaches to bridge the gap between semantically coref- erent URIs and potential candidates. For an object URI, we firstly establish a kernel that consists of semantically coref- erent URIs based on owl:sameAs, (inverse) functional prop- erties and (max-)cardinalities, and then extend such kernel iteratively in terms of discriminative property-value pairs in the descriptions of URIs. In particular, the discriminability is learnt with a statistical measurement, which not only ex- ploits key characteristics for representing an object, but also takes into account the matchability between properties from pragmatics. In addition, frequent property combinations are mined to improve the accuracy of the resolution. We imple- ment a scalable system and demonstrate that our approach achieves good precision and recall for resolving object coref- erence, on both benchmark and large-scale datasets.
Author-supplied keywords
A Self-Training Approach for Resolving Object Coreference on the Semantic Web Categories and Subject Descriptors
Coreference on the Semantic Web
Wei Hu
State Key Laboratory for Novel
Software Technology
Nanjing University
Nanjing, 210093, PR China
whu@nju.edu.cn
Jianfeng Chen
State Key Laboratory for Novel
Software Technology
Nanjing University
Nanjing, 210093, PR China
jf_chen@ymail.com
Yuzhong Qu
State Key Laboratory for Novel
Software Technology
Nanjing University
Nanjing, 210093, PR China
yzqu@nju.edu.cn
ABSTRACT
An object on the Semantic Web is likely to be denoted with
multiple URIs by different parties. Object coreference reso-
lution is to identify “equivalent” URIs that denote the same
object. Driven by the Linking Open Data (LOD) initiative,
millions of URIs have been explicitly linked with owl:sameAs
statements, but potentially coreferent ones are still consider-
able. Existing approaches address the problem mainly from
two directions: one is based upon equivalence inference man-
dated by OWL semantics, which finds semantically corefer-
ent URIs but probably omits many potential ones; the other
is via similarity computation between property-value pairs,
which is not always accurate enough. In this paper, we pro-
pose a self-training approach for object coreference resolu-
tion on the Semantic Web, which leverages the two classes
of approaches to bridge the gap between semantically coref-
erent URIs and potential candidates. For an object URI, we
firstly establish a kernel that consists of semantically coref-
erent URIs based on owl:sameAs, (inverse) functional prop-
erties and (max-)cardinalities, and then extend such kernel
iteratively in terms of discriminative property-value pairs in
the descriptions of URIs. In particular, the discriminability
is learnt with a statistical measurement, which not only ex-
ploits key characteristics for representing an object, but also
takes into account the matchability between properties from
pragmatics. In addition, frequent property combinations are
mined to improve the accuracy of the resolution. We imple-
ment a scalable system and demonstrate that our approach
achieves good precision and recall for resolving object coref-
erence, on both benchmark and large-scale datasets.
Categories and Subject Descriptors
H.3 [Information Systems]: Information Storage and Re-
trieval; D.2.12 [Software Engineering]: Interoperability;
I.2.6 [Artificial Intelligence]: Learning
General Terms
Algorithms, Experimentation, Performance
Keywords
Object coreference, object consolidation, self-training, prop-
erty combination, data fusion
Copyright is held by the International World Wide Web Conference Com-
mittee (IW3C2). Distribution of these papers is limited to classroom use,
and personal use by others.
WWW 2011, March 28–April 1, 2011, Hyderabad, India.
ACM 978-1-4503-0632-4/11/03.
1. INTRODUCTION
The Semantic Web is an ongoing effort by the W3C Se-
mantic Web Activity, with the purposes of actualizing data
integration and sharing among different applications and or-
ganizations. To date, a number of prominent ontologies have
emerged for publishing data in specific domains, such as the
Friend of a Friend (FOAF), which define common identifiers
for classes and properties, in the form of URIs, which have
been widely used across data sources.
In the instance level, however, it is still far from achieving
agreement among data sources on the use of common URIs
to identify a specific object [11]. In fact, due to the decentral-
ized and dynamic nature of the Semantic Web, it frequently
happens that many different URIs from a variety of sources,
more likely originating from different RDF documents, de-
note one real-world object, i.e., represent the same identity.
Such examples exist in the domains of personal profiles, aca-
demic publications, encyclopedic or geographical resources,
etc.
Object coreference resolution, also known as object consol-
idation or identification [1], is a task for identifying multiple
URIs for the same real-world object, i.e., finding coreferent
URIs which represent a unique identity. Object coreference
resolution is important for data-centric applications, such as
fusing distributed descriptions of equivalent RDF resources
in data integration systems.
Driven by the Linking Open Data (LOD) initiative, mil-
lions of URIs from independent data sources have been ex-
plicitly interlinked with owl:sameAs statements [3]. Howev-
er, considering billions of object URIs on the current Seman-
tic Web, we observe that there still exist a large amount of
URIs which implicitly represent the same objects but have
not been connected with owl:sameAs yet. For example, at
least 70 URIs returned by Falcons search engine [2] denote
a person “Tim Berners-Lee”, the director of W3C, but only
five of them are linked with owl:sameAs.
In the field of Semantic Web, recent studies address this
problem mainly from two directions: one is based on utiliz-
ing standard OWL semantics, such as owl:sameAs [8] and
inverse functional properties (IFPs) [11]; while the other is
according to the intuition that two URIs represent the same
real-world object if they share some similar property-value
pairs [7, 13]. Generally speaking, the semantics-based way
can infer explicitly coreferent URIs but probably misses a
lot of potential candidates, while the similarity-based way is
not always accurate due to heterogenous ways for expressing
the same thing. Hence, a key issue for resolving object coref-
WWW 2011 – Session: Web Mining March 28–April 1, 2011, Hyderabad, India
87
classes of techniques for building bridges between coreferent
URIs that we already have and potential candidates?
In this paper, we propose a self-training approach to lever-
aging the semantics-based and similarity-based ways for ad-
dressing the problem of object coreference resolution on the
Semantic Web. Self-training is a well-known class of semi-
supervised learning algorithms, in which a learner continues
labeling unlabeled examples and re-training itself on an ex-
tended labeled training set [28]. Self-training is suitable for
solving our problem, because there are abundant unresolved
object URIs, but the number of existing semantically coref-
erent ones is limited.
Specifically, taking an object URI as input, we firstly es-
tablish a kernel that consists of a set of semantically coref-
erent URIs based on owl:sameAs, (inverse) functional prop-
erties and (max-)cardinalities, and then iteratively extend
the kernel in terms of discriminative property-value pairs in
the descriptions of URIs. The discriminability of a property-
value pair is learnt based on a statistical measurement, which
not only exploits the key characteristics for representing an
object, but also takes into account the matchability between
properties from pragmatics. Furthermore, frequent property
combinations are mined to enhance the selection criteria of
properties during each iteration, so that the accuracy of the
resolution is further improved. We develop a scalable system
and evaluate its performance on a benchmark dataset from
OAEI 2010 and a large-scale dataset that is collected by Fal-
cons search engine in 2008. The experimental results demon-
strate that our approach achieves acceptable F-Measure on
both datasets, as compared with the performance of six rep-
resentative competitors.
The remainder of this paper is structured as follows. The
self-training framework of our proposed approach is firstly
outlined in Section 2. Section 3 introduces a method to find
semantically coreferent URIs in terms of OWL semantics.
Section 4 describes our self-training algorithm for resolving
object coreference with a statistical measurement to calcu-
late the discriminability of a property-value pair. Section 5
presents a way to mine frequent property combinations for
improving the accuracy of the resolution. Experimental re-
sults on the benchmark and large-scale datasets are report-
ed in Section 6. Section 7 discusses related work and finally
Section 8 concludes this paper with future work.
2. OVERVIEW OF THE APPROACH
The architecture of our proposed approach is outlined in
Fig. 1, which starts with an object URI u. After three pro-
cessing stages, the approach returns a set of coreferent URIs
that denote the same object as u.
1. Building a kernel. We construct a kernel of semanti-
cally coreferent URIs for u based on the OWL seman-
tics of owl:sameAs, owl:InverseFunctionalProperty
(owl:IFP for short), owl:FunctionalProperty (abbr.
owl:FP), owl:cardinality and owl:maxCardinality.
The five built-in vocabulary elements in OWL are fre-
quently used to infer the equivalence relation in many
systems [19], and combining them together establishes
a larger initial labeled training set.
2. Learning discriminative property-value pairs. It
is an iterative process, which firstly learns discrimina-
tive property-value pairs from some labeled coreferent
/HDUQLQJGLVFULPLQDWLYH
SURSHUW\YDOXHSDLUV
6H
OIW
UD
LQL
QJ
%XLOGLQJDNHUQHO
,QLWLDOL]LQJWUDLQLQJVHW
([WHUQDONQRZOHGJH
$85,
)UHTXHQW
SURSHUW\
FRPELQDWLRQV
/DEHOHG85,V
&RUHIHUHQW85,V
8QODEHOHG85,V
Figure 1: Overview of the proposed approach
URIs, and then uses the pairs to find more coreferent
ones. In accordance with previous works in [10, 13, 17],
we assume that coreferent URIs share several common
property-value pairs, and certain property-value pairs
are more useful for coreference resolution.
For any two URIs, we extract their involved property-
value pairs from the dereference documents,1 and com-
pare these values with a string matching algorithm I-
Sub [22]. If the similarity between two values is larger
than a threshold, then the related two properties have
a kind of commonality. For a set of coreferent URIs, we
select a property pair sharing most matchable values,
and assign the most common value to each property in
this pair. These two property-value pairs reflect some
discriminative characteristics for their denoted object,
and they are used to find more coreferent URIs.
3. Choosing properties based on frequent proper-
ty combinations. Some properties are more suitable
to use together for describing an object, such as longi-
tude and latitude for a coordinate. If we only choose
either of them for identifying coreferent URIs, the re-
sults tend to be inaccurate. Therefore, we apply asso-
ciation rule mining to discover frequent property com-
binations with heuristic refinement. For each learning
iteration, if any property in a frequent property combi-
nation is chosen, the rest property in the combination
with its most common value (if existing in the training
set) would be complemented. Consequently, these two
properties with associated values are used together for
searching new coreferent URIs.
Example 1. For illustration purposes, let us consider four
RDF documents containing candidate URIs for coreference
resolution in Fig. 2. Assuming that dbpedia:Beijing is the
input object URI for resolution. Through searching for owl:
sameAs statements, dbpedia:Beijing is semantically coref-
erent with geo:1816670. During training, (rdfs:label, “Bei-
jing”) and (geo:alternateName, “Beijing”) are learnt in the
first iteration as the most discriminative property-value pairs.
As a result, semweb:Beijing is found. In the second round,
(wgs84_pos:lat, “40”) is the most discriminative pair, and a
wrong coreferent result ex:New_York is discovered. But con-
sidering the frequent property combination {wgs84_pos:lat,
wgs84_pos:long}, ex:New_York would not be included any
more, because the values of wgs84_pos:long are completely
different (“116” for Beijing, while “74” for New York).
1The act of retrieving a representation of a resource identi-
fied by a URI is referred to as dereferencing that URI [15].
WWW 2011 – Session: Web Mining March 28–April 1, 2011, Hyderabad, India
88
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


