Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation
Abstract
The Web of Data is increasingly becoming an important infrastructure for such diverse sectors as entertainment, government, e- commerce and science. As a result, the robustness of this Web of Data is now crucial. Prior studies show that the Web of Data is strongly dependent on a small number of central hubs, making it highly vulnerable to single points of failure. In this paper, we present concepts and al- gorithms to analyse and repair the brittleness of the Web of Data. We apply these on a substantial subset of it, the 2010 Billion Triple Challenge dataset. We first distinguish the physical structure of the Web of Data from its semantic structure. For both of these structures, we then calculate their robustness, taking betweenness centrality as a robustness- measure. To the best of our knowledge, this is the first time that such robustness-indicators have been calculated for theWeb of Data. Finally, we determine which links should be added to the Web of Data in order to improve its robustness most effectively.We are able to determine such links by interpreting the question as a very large optimisation problem and deploying an evolutionary algorithm to solve this problem. We believe that with this work, we offer an effective method to analyse and improve the most important structure that the Semantic Web community has constructed to date.
Finding the Achilles Heel of the Web of Data : using network analysis for link-recommendation
using network analysis for link-recommendation
Christophe Gueret, Paul Groth, Frank van Harmelen, Stefan Schlobach
fcgueret,pgroth,Frank.van.Harmelen,schlobacg@few.vu.nl
VU University Amsterdam
De Boelelaan 1081a, 1081 HV, Amsterdam, The Netherlands
Abstract. The Web of Data is increasingly becoming an important in-
frastructure for such diverse sectors as entertainment, government, e-
commerce and science. As a result, the robustness of this Web of Data
is now crucial. Prior studies show that the Web of Data is strongly de-
pendent on a small number of central hubs, making it highly vulnerable
to single points of failure. In this paper, we present concepts and al-
gorithms to analyse and repair the brittleness of the Web of Data. We
apply these on a substantial subset of it, the 2010 Billion Triple Chal-
lenge dataset. We rst distinguish the physical structure of the Web of
Data from its semantic structure. For both of these structures, we then
calculate their robustness, taking betweenness centrality as a robustness-
measure. To the best of our knowledge, this is the rst time that such
robustness-indicators have been calculated for the Web of Data. Finally,
we determine which links should be added to the Web of Data in order
to improve its robustness most eectively. We are able to determine such
links by interpreting the question as a very large optimisation problem
and deploying an evolutionary algorithm to solve this problem. We be-
lieve that with this work, we oer an eective method to analyse and
improve the most important structure that the Semantic Web commu-
nity has constructed to date.
1 Introduction
The rapidly growing Web of Data increasingly resembles the Web in its network
properties. It resembles a small world network that relies on central hubs to
provide connectivity between resources on the Web of Data [10]. Such central
hubs are potential points of failure. This is particularly dangerous for the Web
of Data, which, unlike the Web, is designed to be used by automated agents that
have less capability to recover from lack of access to resources than human users
might have on the regular Web.
Current approaches to ensure robustness of the Web of Data are based on
anecdotal observations. In this work, we propose a systematic approach for
analysing the Web of Data and recommending where links can be added to
help ensure robustness against both infrastructure failure and semantic devia-
tion. An example of the rst is: how can we ensure that automated agents can
SIOC ontology is updated, where do links need to be introduced to re-establish
connectivity?
Our systematic approach uses well known network properties to characterise
the robustness of both the infrastructure and semantic networks within the Web
of Data. Based on these properties, we present an optimisation algorithm that
produces recommendations about where links should be added to the Web of
Data. The algorithm takes into account whether additional links would be se-
mantically meaningful.
The contributions of this paper are (i) a characterisation of the strength of
the current Web of Data in terms of its infrastructure and semantic network.;
(ii) a recommendation algorithm for adding links to the Web of Data to increase
its robustness; and (iii) applying this algorithm in order to determine how many
(and which) links are required to obtain dierent levels of robustness.
Our main ndings are that (a) the current Web of Data is indeed highly
sensitive to failure of individual nodes, both at the infrastructure level and as a
semantic network, and (b) this situation can be remedied by adding a surprisingly
small number of links, provided that these links are chosen well, as calculated
by our recommendation algorithm.
The paper is organised as follows. In Section 2, we discuss related work and
argue why it is useful to distinguish infrastructural connectivity and semantic
connectivity. This leads to Section 3 where the robustness of the current Web
of Data is measured, followed by Section 4, which presents an algorithm to
recommend how best to increase that robustness. Section 5 concludes.
2 Background
2.1 Related Work
The use of network properties to study complex systems has grown in a wide
range of elds (e.g. biology, social science and web science) because it provides a
mechanism to extract global properties of systems [12]. In terms of robustness,
the classic result is from Barabasi, which shows that scale-free networks are
robust against random failure, but not against targeted attacks [1]. The robust-
ness of scale-free networks is important because they are widely seen in nature
including power grids, the World Wide Web and social networks [2].
The application of such network analysis to the Web of Data has until now
been limited, and has been performed on a wide variety of graph-structures: [10]
analysed the 2009 BTC dataset1 and showed that, interpreted as a sample of
the Web of Data, it is scale-free and that semanticdesktop.org and purl.org
are central in it. The same paper also analysed the well-known \bubble-graph"
of the Web of Data, consisting of the datasets published and interlinked by the
Linking Open Data project2. It showed the existence of topic-oriented hubs,
1 http://vmlion25.deri.ie/index.html
2 http://esw.w3.org/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
the shortest paths in the graph being routed through either DBPedia or DBLP.
In recent work, [8] analysed the \object link graph": the Web of Data re-
stricted to its object-to-object links, i.e. after removing all links from objects to
classes, and all class- and property-hierarchies. They found that this object-link
graph also has a scale-free nature, with a diameter value of 12, which is small
compared to the size of the graph, although the link density is rather low. Such
a small diameter of a large but low density graph again points to the presence
of central hubs that provide the main connectivity between many resources.
Other work, such as [9], also use network analysis tools, but apply them
only to networks of ontologies, and do not consider the much more substantial
collection of instances that form the real content of the Web of Data. At an
even smaller scale, [13] applies concepts from network analysis to individual
ontologies.
Summarising, only a handful of analyses have been performed on the network
properties of the Web of Data. Furthermore, all these works have only analysed
the Web of Data, but nobody has used the results of their analysis to eectively
compute improvements to the Web of Data.
2.2 Infrastructure failure and semantic failure
Connectivity on the Web of Data can be disrupted in two dierent ways: infras-
tructural failure or semantic failure. For the infrastructure, the problem is server
unavailability, e.g. the dbpedia.org server is down. In the semantic network, the
problem is robustness against change, for example still using sioc:User instead
of the current sioc:UserAccount.
The robustness of an infrastructure is commonly improved by the use of mir-
rors and caches. Our approach is complimentary to using these techniques. In
order to detect hosts that function as infrastructure hubs, and whose unavail-
ability would hence break many paths, we aggregate the Web of Data into a
hostname graph:
Denition 1 (hostnames graph). The hostname graph H is a hV;Ei where
h 2 V is a node of H i h is used as a hostname in any URI on the Web of
Data, and e 2 E; e = hh1; h2i is an edge of H from node h1 to node h2 i there
is a triple hs; p; oi anywhere on the Web of data with h1 the hostname referred
to in the URI of s and h2 the hostname referred to in the URI of o.
Thus, the hostname graph has as many nodes as there are hostnames mentioned
in all the triples on the Web of Data.
Similarly, the namespace graph is an aggregation of the semantic structure
of the Web of Data:
Denition 2 (namespaces graph). The namespace graph S is a tuple hV;Ei
where n 2 V is a node of S i n is used as a namespace anywhere on the Web of
Data, and e 2 E; e = hn1; n2i is an edge of S from node n1 to node n2 i there
is a hs; p; oi anywhere on the Web of Data with n1 the namespace of s and n2
the namespace of o.
tioned in all the triples on the Web of Data.
Denition 3 (content of nodes). The content cont(n) of a node n is dened
as the set of URI such that there is a hr; p; oi anywhere on the Web of Data and
n is the namespace of r for a namespaces graph or n is the hostname of r for
an hostnames graph.
3 Analysing the Web of Data
The networks and the programs described in this section are all publicly available
at http://linkeddata.few.vu.nl/wod_analysis/.
3.1 Measures of Robustness
By robustness of a graph, we mean the degree to which connectivity in a graph
is maintained after a node is removed from the graph. There are a number of
network measures that can be used for measuring the robustness of a graph. For
example, the diameter of a graph3 provides information about connectivity. A
smaller diameter implies that there are a large number of connections within
the network while a larger diameter means that the network is less connected.
While the diameter provides a reasonable global summary statistic, centrality
statistics allow one to investigate the graph on a per node basis. In particular,
betweenness centrality measures how often a node occurs on a shortest path any
pair of nodes:
Denition 4 (Betweenness centrality). For a graph G = (N;E) with a set
of nodes N and a set of edges E, the betweenness centrality B(n) of a node
n 2 N is dened as
B(n) =
X
s6=n 6=t2N
S(s; n; t)
S(s; t)
where S(s; t) is the number of shortest paths from s to t, and S(s; n; t) is the
number of shortest paths from s to t that pass through node n. Instead of B(n)
we will often report on its non-normalised version, B0(n).
B0(n) =
X
s6=n 6=t2N
S(s; n; t);
Instead of \betweenness centrality" we will often simply speak of \betweenness".
Betweenness is a measure of the importance of a node for the connectivity
between other nodes. The intuition is that if a node lies on many shortest paths
it is an important node, since removal of such a node will directly in
uence
3 The diameter of a graph is the longest shortest path in the graph.
paths will have to be followed.4
A completely connected network has the maximal robustness, and corre-
spondingly the lowest betweenness centrality: B(n) = 0 for every n 2 N , and re-
moving one node does not impact the overall connectivity of the network greatly.
If we want to improve the robustness of the Web of Data, we will want to
lower the number of nodes that have high betweenness centrality, since these
are important potential points of failure. For this, we will rst need to analyse
which nodes actually have a high betweenness centrality. This is obviously com-
putationally intensive, since it involves calculating the shortest paths between
all pairs of nodes on the Web of Data. This robustness analysis will be topic of
the remainder of this section. Deciding how to improve the robustness will be
tackled in Section 4.
3.2 Dataset
The 2010 Billion Triple Challenge (BTC) Dataset5 was used as a representative
sample of the Web of Data. It contains roughly 3.2 billion statements. From this
dataset, the hostname graph and namespace graph were constructed. Given that
namespaces cannot be systematically identied given a URL alone, we used a
predened list of widely used namespaces as dened by the prefix.cc service.
Out of the 330 namespaces registered on the services, 198 were found to be used
in the snapshot used to create the networks.
We removed from the BTC all triples where the object was a literal, all
triples containing blank nodes, and all triples that refer only to URI's from the
same dataset, since none of these triples would contribute to the objects of our
study, namely the hostname graph and the namespace graph. Surprisingly, this
reduced the BTC dataset to 530 million triples, showing that the vast majority
of the Web of Data (or at least the BTC snapshot of it) does not contribute to
it being a \web". Of those remaining 530 million triples, the vast majority (389
million) were covered by the namespace list built from prefix.cc. This gives
us some condence that the namespace list is suciently representative set of
namespaces for building our namespace graph.
Network name Number of nodes Number of edges
Hostnames 558841 656012
Namespaces 198 936
Table 1. Size of the two studied networks
4 Of course, if we are interested in connectivity, it is only an approximation to as-
sume that connections only happen along shortest paths; variations of betweenness
centrality such as \
ow betweenness" and \random walk betweenness" have been
proposed to allow for this. In many practical cases however, the simple (shortest
path) betweenness centrality gives quite informative answers [12].
5 http://km.aifb.kit.edu/projects/btc-2010/
tribution of both the hostname graph (infrastructure) and the namespace graph
(semantic links). Both distributions exhibit a pattern that is not linear. The
degree shown in distribution does not follow a power law. From this we can con-
clude that that these two networks are not scale-free. However, they still have a
few strongly connected hubs.
(a) Namespaces (b) Hostnames
Fig. 1. Degree distribution of the namespaces and hostnames networks
3.3 Robustness Results
Based on the extracted graphs, we calculated the betweenness centrality for all
nodes in both graphs using the Small-world Network Analysis and Partitioning
software (SNAP) [4]. Given the size of the hostname graph, we used an approx-
imation algorithm implemented by SNAP and set the sampling percentage to
10% of all nodes. This is double the 5% percentage suggested for use in [4]. For
more details on the algorithm used, see [3].
Infrastructure Analysis Table 2(a) shows the non-normalised betweenness
distribution (B0(n)) among the hostnames on the Web of Data, in ten bins
starting from the maximal centrality and working down to zero. We note that
the distribution does not follow a power-law curve but is in fact more extreme:
essentially, almost all infrastructural connectivity on the Web of Data is me-
diated by only 3 servers. Table 2(b) reveals which hosts these are: xmlns.org,
dbpedia.org and purl.org. All this points to an extreme brittleness of the in-
frastructure underlying the Web of Data: only taking out a handful of servers
would completely cripple the entire network.
Figure 2 provides a good example of the potential impact that the dominance
of hubs could have on the Web of Data. Recently, Radar Networks which owned
twine.com to Evri was smooth, it is entirely possible that www.twine.com could
have ceased to exist or no longer supported Web of Data content as a result of
this takeover. Our analysis shows that this would have had a substantial impact
on the infrastructural connectivity of the Web of Data.
B0(n)7 #Nodes
5 6 109 2
4 5 109 0
3 4 109 0
2 3 109 1
1 2 109 0
0:5 1 109 4053
0 0:5 109 554785
(a) Distribution of the be-
tweenness results
Hostname B0(n)
xmlns.com 5 693 379 049
dbpedia.org 5 432 125 038
purl.org 2 163 504 423
www.kanzaki.com 532 149 372
www.w3.org 470 113 796
dbtune.org 323 796 691
identi.ca 318 896 524
www.twine.com 299 237 555
semanticweb.org 277 374 029
dblp.l3s.de 225 602 575
(b) Top 10 hostnames and their be-
tweenness result
Table 2. Histogram of betweenness for hostnames and the top ten hostnames with the
highest betweenness
The 554 785 hostnames with a betweenness of 0 are dead ends in the network.
Some of these hosts may be used to serve only non semantic content, such as
images. Thus, they do not provide resources that can be interlinked and used
to walk through the network. The 4056 other hosts are more representative of
the interlinkage status of the graph. This number is much higher than the 198
nodes in the namespaces network (these namespaces account for 60 dierent
hostnames).
Semantic Network analysis Similar to the infrastructure network analysis,
Table 3a shows the betweenness distribution of the namespaces, again arranged
in 10 bins. The majority of nodes are not in-between at all and the overall
distribution mirrors that of the hostnames graph. The semantic network of the
Web of Data, like its infrastructure network, also relies heavily on hubs. Table 3b
shows these hubs. These are indeed the hubs one would expect, perhaps with
the exception of example.org, which, by denition, can provide no connectivity
to other namespaces because it is reserved for examples8.
6 http://www.novaspivack.com/uncategorized/evri-ties-the-knot-with-twine
8 See RFC2606, http://www.rfc-editor.org/rfc/rfc2606.txt
8001-9000 1
7001-8000 1
6001-7000 0
5001-6000 2
4001-5000 0
3001-4000 1
2001-3000 0
1001-2000 6
1-1000 70
0 117
(a) Distribution of the
betweenness results
Namespace B0(n)
www.w3.org/1999/02/22-rdf-syntax-ns# 8783
example.org/ 7191
dbpedia.org/resource/ 5428
xmlns.com/foaf/0.1/ 5030
www.w3.org/2002/07/owl# 3926
sw.opencyc.org/concept/ 1764
www.w3.org/2007/uwa/
context/deliverycontext.owl# 1737
www.w3.org/2003/01/geo/wgs84_pos# 1609
www.semanticdesktop.org/
ontologies/2007/11/01/pimo# 1300
ontologies.ezweb.
morfeo-project.org/eztag/ns# 1225
(b) Top 10 namespaces and their betweenness result
Table 3. Histogram of betweenness for namespaces and the top ten namespaces with
the highest betweenness
4 Improving the Web of Data
The previous section has shown that the Web of Data is extremely brittle, and
relies on a very small number of hubs that are crucial to its connectivity. Both
the infrastructure network and the semantic network could be be strengthened
by judiciously adding links to the network The expected impact of such new links
is to reduce the variation of the centrality among the nodes of a graph, thereby
diminishing the importance of hubs. The variation of betweenness centrality
within a graph is termed the centralisation betweenness index [7]:
Denition 5 (centralisation betweenness index). Given a graph, G =
(N;E) with a set of nodes N and a set of edges E, the centralisation betweenness
index C(G) of G is dened as
C(G) =
GX
i=1
[maxn2N (B(n)) B(i)]
(jN j 1)
where B(n) is the betweenness of node n in the graph.
4.1 The cost of xing the WoD
The simplest way of reducing C(G) would be to make G a fully connected graph,
resulting in an optimal value of C(G) = 0. Of course, for the Web of Data this is
neither feasible nor desirable, because only semantically meaningful links should
be added. Besides, the creation of new edges has a cost. As is well known from
the ontology mapping domain, establishing new relations between two ontologies
sameAs triple is challenging.
We have therefore chosen to characterise the problem of recommending where
to introduce edges in the Web of Data as an optimisation problem that minimises
the centralisation index C(G) while at the same time minimising the cost of
introducing an edge.
In the following, we estimate the cost of adding an edge as the inverse of
the overlap between the used vocabularies. This estimates the chances of nding
pairs of concepts or resources based on the shared usage of predicates by the
respective nodes. Intuitively, this cost measure favours \meaningful" edges, i.e.
edges between nodes with overlapping vocabularies. Of course, this is a very
rough estimation, that could be changed for a more accurate one without im-
pairing the applicability of our algorithms.
Denition 6 (vocabulary of a node). The vocabulary of a node n from either
a hostnames graph H or a namespaces graph S is the set of predicates used to
describes the resources contained in the node.
vocab(n) = fp j 9hr; p; oi; r 2 cont(n)g
Our semantic cost for a link between two nodes will be based on the similarity
of the vocabularies used in the nodes. We used the standard Jaccard measure to
quantify the similarity between vocabularies. This is a measure commonly used
in the ontology mapping domain.
Denition 7 (Vocabulary Similarity). The similarity S(n1; n2) between two
nodes n1 and n2 from either the hostname graph or the namespace graph is
dened as:
S(n1; n2) =
jvocab(n1) \ vocab(n2)j
jvocab(n1) [ vocab(n2)j
The corresponding cost of the edge, hn1; n2i, is dened as the complementary of
the similarity between the nodes:
cost(hn1; n2i) = 1 S(n1; n2)
Of course, we could use any other measure for semantic overlap from work in on-
tology alignment [6], and again these could be easily plugged into the algorithms
we will describe next.
Using these calculations as our basis we now dene the optimisation problem
as follows:
minimize B(< N;E0 >) subject to min
X
e2E0
cost(e), where E0 = E [ (N N)
Note, that E0 is the union of the existing edges with some set of newly
introduced edges from the space of all possible edges in the graph.
already connecting these islands. The algorithm implementing this strategy is
similar to the previous one and has the same scalability constraint.
Selective Strategies
Choose randomly Rather than focusing on the cheapest or the most expensive
nodes, it could be interesting to select a sample of X of them with dierent costs.
The expected result is to mix bridging some clusters and increasing the density
of others. The easiest most straightforward approach is then to randomly select
the set of edges to create.
The algorithm implementing this strategy simply creates a set of new edges
by sampling two random values between 1 and n. If the drawn edge is already
present in the graph or in the set of edges to add, the process is repeated.
Choose wisely This last strategy accounts for a property ignored by all other
strategies: the fact that some edges could be nice to add in combination with
others. Indeed, the centrality gain is likely not to depend only on how many new
edges are created but also on which ones. The idea then is not to only select the
edges to add one by one but to focus on a group of edges of size X, all at once.
Instead of creating only one set of edges like in the random selection, several
sets are investigated in parallel and iteratively improved. This search strategy
is done by an evolutionary algorithm, a population based class of algorithm
known to perform well on combinatorial optimisation problems [5]. The outline
of the evolutionary algorithm, a standard one, is detailed in Algorithm 1. It is
a generational evolution with an elitism of 1: every new generation replaces the
previous set of candidates with the exception of the best one which is kept.
4.3 Repair of the namespaces network
The namespaces network contains 198 nodes for 936 edges, leaving room for
198 197 936 = 38070 new edges. The Figures 3 and 4 reports the result of
the previously introduced strategies on that network.
The two greedy strategies are compared in gure 3. It can be observed that
none of these baselines perform very well in two aspects: (1) many links must be
added before obtaining a reasonable improvement of the centrality. 2500 links
have to be added to halve the centrality. (2) both strategies rst create more
damage than improvements. The centrality rst increases before going down
again. Also, this behaviour is monotonic only after a minimum of edges have been
added meaning that these strategies are only applicable if a minimum amount of
resources are available. There is however a clear winner on this picture: adding
edges by increasing cost is the best approach, damaging less of the network and
decreasing its centrality starting at 125 edges. It can thus be concluded that
focusing on the easiest pairs is best idea when one can not do better and X is
large enough.
point crossover" operation than mixes two candidate solutions.
Initialise population P ;
while not terminated do
/* Evaluation of current sets */
foreach Candidate set of edges s in P do
compute CB(<N;E[s>)CB(<N;E>)
/* Creation of new sets */
P 0 best individual from P ;
while Size of P 0 dierent than size of P do
switch with a probability of 0.1 do
s tournament selection from P ;
s0 tournament selection from P ;
P 0 P 0 [ s s00
switch with a probability of 0.8 do
s tournament selection from P ;
foreach edge si of s do
switch with a probability of 0.1 do
si randomly created new edge
P 0 P 0 [ s;
/* Generation replacement */
P P 0;
0
0.
5
1
1.
5
2
2.
5
1
2
5
10
25
50
10
0
25
0
50
0
10
00
25
00
10
00
0
25
00
0
Centrality ratio
N
um
be
r o
f e
dg
es
a
dd
ed
to
th
e
gr
ap
h
ta
rg
et
In
cr
ea
sin
g
co
st
D
ec
re
as
in
g
co
st
Fig. 3. Comparison of the two greedy strategies that consist in sorting all the edges
according to their cost and insert them one by one, by (in/de)creasing cost.
gies. The results from the two selective strategies are reported in Figure 4. Our
rst observation is that both strategies outperform the greedy approaches: they
are less damaging and reduce centrality faster. The random choice technique has
some uncertain behaviour when less than 250 edges are added but is guaranteed
to decrease the centrality by almost 60% if at least 1000 edges are created (e.g.
2% of the amount of possible new edges). Both algorithms monotonically im-
prove the centrality as soon as more than 250 edges are added. That is around
30% of the existing 936 edges. Above 10000 new edges, there is no dierence in
the results. For less than 250 new edges, the evolutionary algorithm nds the
best sets. It achieves the best performance, decreasing the centrality by almost
60%, with a set of only 64 edges.
0
0.
2
0.
4
0.
6
0.
8
1
1.
2
1.
4
1
2
5
10
25
50
10
0
25
0
50
0
10
00
25
00
10
00
0
25
00
0
Centrality ratio
Si
ze
o
f t
he
se
t o
f e
dg
es
a
dd
ed
ta
rg
et
R
an
do
m
c
ho
ic
e
Ev
ol
ut
io
na
ry
a
lg
or
ith
m
Fig. 4. Comparison of the two selective strategies applied to the namespaces network.
They consist in creating a set of edges to add, either by random choice or iterative
construction (evolutionary algorithm). The goal is to bring the ratio, at least, below
1.0 and, at best, close to 0.
Table 4 shows the four links recommended to create in order to decrease
the centrality of the network by 30%. We now discuss whether the addition of
the suggested links is feasible. Row Ê suggests creating a link from the Life-
cycle Schema to Freebase. The Lifecycle Schema describes the specication of
a generic lifecycle for a resource. It denes notions such state, transition and
task. Links could easily be created from this schema to descriptions of the corre-
sponding concepts in Freebase. For example, one could link to the denition of
Finite-state machine in Freebase (i.e. http://rdf.freebase.com/ns/finite_
state_machine). Row Ë recommends creating a link between annotations about
papers from ISWC 2004 to the Ubiquitous Applications Location Ontology. This
seems reasonable since one could describe the papers as having been presented
note is that the given link for ISWC 2004 annotations is no longer operative. It
should probably be updated to the Semantic Web Dogfood site. This is another
example where old links cause the Web of Data to break. The third recom-
mendation, Row Ì, suggests adding a link between a site describing labels for
about 1 million commodities to SKOS-XL (an ontology for describing labels).
A connection between these sites again seems reasonable as one could possibly
describe these commodity labels as subclasses of skosxl:Label. Finally, the rec-
ommendation, Row Í, to link the Dublin Core types to the Cyc Ontology also
could be done given that the Dublin Core types describe generic types such as
Event, Image, Sound, which also appear in Cyc.
From namespace To namespace Cost
Ê http://purl.org/vocab/
lifecycle/schema#
http://rdf.freebase.com/ns/ 0.999803
Ë http://annotation.semanticweb.
org/2004/iswc#
http://www.w3.org/2007/uwa/
context/location.owl#
0.892857
Ì http://openean.kaufkauf.net/id/ http://www.w3.org/2008/05/
skos-xl#
1.0
Í http://purl.org/dc/dcmitype/ http://sw.opencyc.org/concept/ 1.0
Table 4. When added all together to the namespaces graph, these 4 edge brings the
centrality to 70% of its original value.
4.4 Repair of the hostnames network
The hostnames network contains 558784 nodes for 656012 edges, leaving room
for 558784 558784 656012 = 312238902644, 312 Billions, new edges. Unfor-
tunately, such a huge number of edges makes search by enumeration impossible
and the greedy approaches inapplicable. Instead, we only apply the selective
strategies.
For the random strategy, as long as the number of edges added reaches 100M
(that is, 0.03% of the 312B possibilities), it does not matter which ones are added.
In every case, the centrality is diminished by at least 90%, going to 10% of the
original value. This applies similarly for the evolutionary strategy, however, that
strategy performs slightly better than the random strategy. Unfortunately, both
strategies have a signicant adverse impact on the hostname network before any
improvement is seen for less than 100M edges added and no impact for less than
10k edges.
5 Conclusion
We can divide the conclusions of this paper into two categories: (i) generic
methods for analysing the Web of Data, and (ii) specic observations on the
state of the current Web of Data.
{ We have dened two useful abstractions over the Web of Data, the hostname
graph and the namespace graph, allowing us to analyse both the infrastruc-
tural and its semantic connectivity of the Web of Data.
{ Following insights from network analysis, we have proposed betweenness
centrality as the key metric for measuring network robustness (= the ability
to maintain connectivity after removal of nodes).
{ We have phrased the problem of improving the robustness as an optimisation
problem, aiming to minimise the graph's centrality index under minimal cost
of adding links. We proposed as a cost-function the Jaccard distance measure
based on vocabulary overlap, but our approach is neutral as to the choice of
the cost-function.
{ We investigated the feasibility of a number of algorithms to solve this opti-
misation problem, and showed that, in particular, the use of an evolutionary
algorithm was successful in identifying a small number of links that substan-
tially increase the robustness of the graph.
Observations on the state of the current Web of Data Assuming that
the BTC dataset is indeed a representative snapshot, the following facts have
been revealed by our analysis:
{ The vast majority of triples on the Web of Data do not contribute to it being
a web, but instead point to literals or blank nodes, or refer only to URI's
internal to the same dataset. This concerns as much as 80% of all triples.
{ The Web of Data is currently not a scale-free network. It shows a more
extreme distribution, although it has some of the typical properties of a
scale free network, in particular the presence of hub-nodes.
{ Almost all infrastructural connectivity on the WoD is mediated by 3 servers,
xmlns.com, dbpedia.org and purl.org, making the system very brittle.
{ Similarly, almost all semantic connectivity is provided through a small num-
ber of namespaces, again a very brittle structure.
{ On the positive side, the robustness of the Web of Data can be improved
drastically: the centrality of the namespace graph can be improved by a
factor of 2 by adding just 4 edges to the namespace graph.
{ For the hostnames graph, we were not able to nd any such easy xes. In
fact, it seems that the hostnames graph will need substantial (and hence
automated) extension for it to become more robust.
Future Work A rst task would of course be to extend this work to larger
snapshots of the Web of Data, to see if our methods scale and if our ndings
generalise. Currently, the hostname graph is already at the limits of what is
computationally feasible to solve the link-optimisation problem. In particular,
repeatedly testing the centrality index of candidate graphs that are generated
by our evolutionary algorithm is very expensive. An incremental algorithm cal-
culating the centrality index of a slightly modied graph would be helpful here.
into a real-time monitoring engine that would constantly monitor the state of the
Web of Data, e.g. by taking as input a stream of modications, and produce as
output a set of suggestions for useful links to add in order to maintain or improve
robustness. Unlike the regular Web, where failure is tolerated, the Web of Data
is meant for machine consumption, implying that it is more in need of constant
and machine-assisted upkeep. In this paper, we have provided the necessary
abstractions for such quality control, and we have shown that the Web of Data
in its current form has severe vulnerabilities. We have also proposed eective
algorithms for determining repairs. With these results our paper opens the way
towards continuous and machine-assisted repairs to the Web of Data.
In some cases adding a link may be less expensive than deploying a mirror.
While studying the cost of adding links versus that of deploying mirrors goes
beyond the scope of this work, we plan to work on the automated identication
and connection to cached data.
References
1. Albert, R., Jeong, H., Barabasi, A.L.: Error and attack tolerance of complex net-
works. Nature 406(6794), 378{382 (Jul 2000)
2. Amaral, L.a., Scala, A., Barthelemy, M., Stanley, H.E.: Classes of small-world
networks. Proceedings of the National Academy of Sciences of the USA 97(21),
11149{11152 (Oct 2000)
3. Bader, D., Kintali, S., Madduri, K., Mihail, M.: Approximating betweenness cen-
trality. Algorithms and Models for the Web-Graph pp. 124{137 (2007)
4. Bader, D., Madduri, K.: SNAP, Small-world Network Analysis and Partitioning: an
open-source parallel graph framework for the exploration of large-scale networks.
In: IEEE International Symposium on Parallel and. pp. 1{12. Ieee (Apr 2008)
5. Eiben, A., Smith, J.: Introduction to evolutionary computing. Springer (2003)
6. Euzenat, J., Shvaiko, P.: Ontology matching. Springer-Verlag (2007)
7. Freeman, L.C.: A Set of Measures of Centrality Based on Betweenness. Sociometry
40(1), 35 (Mar 1977)
8. Ge, W., Chen, J., Qu, Y.: Object Link Structure in the Semantic Web. In: Pro-
ceedings of the 7th Extended Semantic Web Conference. pp. 257{271 (2010)
9. Gil, R., Garcia, R.: Measuring the semantic web. In: Advances in Metadata Re-
search, Proceedings of MTSR'05. No. ISBN 1-58949-053-3, Rinton Press (2006)
10. Gueret, C., Wang, S., Schlobach, S.: The web of data is a complex system - rst
insight into its multi-scale network properties. In: Proceedings of the European
Conference on Complex Systems (ECCS) (2010), to appear
11. Jari, A., Glaser, H., Millard, I.: Uri identity management for semantic web data
integration and linkage. In: 3rd International Workshop On Scalable Semantic Web
Knowledge Base Systems. Springer (2007)
12. Newman, M.E.J.: The Structure and Function of Complex Networks. SIAM Review
45(2), 167{256 (Jan 2003)
13. Zhang, X., Cheng, G., Qu, Y.: Ontology summarization based on rdf sentence
graph. In: WWW '07: Proceedings of the 16th international conference on World
Wide Web. pp. 707{716. ACM, New York, NY, USA (2007)
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



