eRDF: Live Discovery for the Web of Data
Available from
Christophe Guéret's profile on Mendeley.
Page 1
eRDF: Live Discovery for the Web of Data
eRDF: Live Discovery for the Web of Data
Christophe Gueret, Paul Groth, and Stefan Schlobach
Department of Articial Intelligence, Vrije Universiteit Amsterdam, de Boelelaan
1081a, 1081HV Amsterdam, The Netherlands
Abstract. eRDF is an infrastructure for exploring the Web of Data
through evolutionary querying. The main idea is to employ the well-
known strength of evolutionary strategies to nd good, though possibly
approximate, answers quickly. This allows us discover relevant answers
to a user's information need in an anytime way. As the system is based
on the idea of guessing and verifying solutions, it does not require com-
plex joins, which implies that we can easily query distributed data-sets
(in our case live SPARQL endpoints) thus data can be both local and
distributed. This allows eRDF to scale, eg., our current system provides
access to the Billion Triple Challenge (BTC) data set plus several other
large datasets. Another important feature of our methodology is that it
is robust against complex SPARQL queries, which is a crucial feature
for discovery queries.
The basic functionality of our infrastructure is provided by a simple ex-
ploratory SPARQL endpoint, which allows discovery queries over the
BTC and several other datasources. To show the potential of the in-
frastructure we implemented a prototype web application, Like? , which
allows users to discover resources across the Web of Data.
1 Problem Description
The Web of Data is growing at an amazing rate as more and more data-sources
are being made available online in RDF, and linked. At the same time specialised
triple-stores, such as Virtuoso[9], OWLIM[1] or 4store[6], have matured into
powerful engines that can eciently answer queries for a given schema over
static data sets of billions of RDF triples
However, in many cases the schema is not known, nor is the precise nature
of the search query. As the name suggests, query engines are suitable for precise
querying, but necessarily fail when the task is more explorative, when the user
needs to discover information, rst. A second drawback of current approaches
is that static data sets are explored and queried rather than the actual data
sources themselves. It is acknowledged that the currently most convenient use
of Semantic Data is by querying collections of static data, which are often out-
dated, instead of live discovery. This is due to the diculty of joining results
from dierent engines in federated querying. Finally, given the open character
of the web, which is intrinsically incoherent, incomplete and incorrect, an explo-
ration engine for the Web of Data must be robust. We claim that the eRDF
infrastructure makes signicant steps in these four areas: exploration, live-access,
Christophe Gueret, Paul Groth, and Stefan Schlobach
Department of Articial Intelligence, Vrije Universiteit Amsterdam, de Boelelaan
1081a, 1081HV Amsterdam, The Netherlands
Abstract. eRDF is an infrastructure for exploring the Web of Data
through evolutionary querying. The main idea is to employ the well-
known strength of evolutionary strategies to nd good, though possibly
approximate, answers quickly. This allows us discover relevant answers
to a user's information need in an anytime way. As the system is based
on the idea of guessing and verifying solutions, it does not require com-
plex joins, which implies that we can easily query distributed data-sets
(in our case live SPARQL endpoints) thus data can be both local and
distributed. This allows eRDF to scale, eg., our current system provides
access to the Billion Triple Challenge (BTC) data set plus several other
large datasets. Another important feature of our methodology is that it
is robust against complex SPARQL queries, which is a crucial feature
for discovery queries.
The basic functionality of our infrastructure is provided by a simple ex-
ploratory SPARQL endpoint, which allows discovery queries over the
BTC and several other datasources. To show the potential of the in-
frastructure we implemented a prototype web application, Like? , which
allows users to discover resources across the Web of Data.
1 Problem Description
The Web of Data is growing at an amazing rate as more and more data-sources
are being made available online in RDF, and linked. At the same time specialised
triple-stores, such as Virtuoso[9], OWLIM[1] or 4store[6], have matured into
powerful engines that can eciently answer queries for a given schema over
static data sets of billions of RDF triples
However, in many cases the schema is not known, nor is the precise nature
of the search query. As the name suggests, query engines are suitable for precise
querying, but necessarily fail when the task is more explorative, when the user
needs to discover information, rst. A second drawback of current approaches
is that static data sets are explored and queried rather than the actual data
sources themselves. It is acknowledged that the currently most convenient use
of Semantic Data is by querying collections of static data, which are often out-
dated, instead of live discovery. This is due to the diculty of joining results
from dierent engines in federated querying. Finally, given the open character
of the web, which is intrinsically incoherent, incomplete and incorrect, an explo-
ration engine for the Web of Data must be robust. We claim that the eRDF
infrastructure makes signicant steps in these four areas: exploration, live-access,
Page 2
decentralisation and robustness. We now discuss these areas in more detail before
discussing eRDF and its use in the Billion Triple Challenge (BTC).
Discovery queries The paradigm shift on the WWW from browsing to search
was one of the critical elements for its success as it allowed users to nd relevant
information without knowing its exact location in the network. In search users
dene their needs by providing keywords often with the goal to nd relevant
information without having a specic information source in mind. While se-
mantic search engines, such as sig.ma, are beginning to provide search over the
Web of Data, there is still the need for new techniques to discover what data is
available, particularly, for software agents. Indeed, generating queries for a given
data source usually requires extensive knowledge of that data-source in order to
produce reasonable results. By integrating an approximate component into the
query process, eRDF can aid discovery.
Anytime answers over live distributed data-sources Many of the applications
based on the Web of Data do not use data sources directly, as federated queries
over live SPARQL endpoints is known to be extremely expensive, because known
optimizations (for example to deal with joins) do not work in the distributed
case. Instead, snapshots are taken at intervals, dumped into gigantic repositories
and made available in database style for querying. The eect is that the available
information is constantly outdated, not just the index (as in traditional search
engines), but even the data itself.
eRDF allows distributed queries over live data-sources as only very simple
unary queries are needed. Additionally, eRDF can issues all of its queries in
a fully parallel fashion. There is no theoretical restriction on the number of
data-sources and their data-size only marginally increases individual response
time. Of course, increasing data-size in combination with a constant population
size will increase convergence time. However, given the any-time character of
evolutionary methods good answers are still returned comparatively quickly.
This makes eRDF an interesting alternative for exploration and discovery for
the Web of Data.
Robustness Although SPARQL has been developed as an RDF query lan-
guage for Web data, there is a discrepancy between the database like query
formalism and the adaptive, open-world, incoherent and inconsistent character
of the Web of Data. Schemas are often unknown, and posing promising queries
requires explicit knowledge of the structure of the information. The eect of this
is that many good answers are missed as queries are simply not adequate for
certain information needs. eRDF does not extend SPARQL but releases some Se-
mantic constraints if required by the application. This makes it more robust for
querying unknown information, which is essential for exploration and discovery.
2 The eRDF infrastructure at a glance
In [8,7] we introduced RDF query answering by evolutionary algorithms (eRDF).
The basic idea is simple: instead of indexing the triples and joining results of
ground queries, we guess a population of candidate solutions. Those are then
2
discussing eRDF and its use in the Billion Triple Challenge (BTC).
Discovery queries The paradigm shift on the WWW from browsing to search
was one of the critical elements for its success as it allowed users to nd relevant
information without knowing its exact location in the network. In search users
dene their needs by providing keywords often with the goal to nd relevant
information without having a specic information source in mind. While se-
mantic search engines, such as sig.ma, are beginning to provide search over the
Web of Data, there is still the need for new techniques to discover what data is
available, particularly, for software agents. Indeed, generating queries for a given
data source usually requires extensive knowledge of that data-source in order to
produce reasonable results. By integrating an approximate component into the
query process, eRDF can aid discovery.
Anytime answers over live distributed data-sources Many of the applications
based on the Web of Data do not use data sources directly, as federated queries
over live SPARQL endpoints is known to be extremely expensive, because known
optimizations (for example to deal with joins) do not work in the distributed
case. Instead, snapshots are taken at intervals, dumped into gigantic repositories
and made available in database style for querying. The eect is that the available
information is constantly outdated, not just the index (as in traditional search
engines), but even the data itself.
eRDF allows distributed queries over live data-sources as only very simple
unary queries are needed. Additionally, eRDF can issues all of its queries in
a fully parallel fashion. There is no theoretical restriction on the number of
data-sources and their data-size only marginally increases individual response
time. Of course, increasing data-size in combination with a constant population
size will increase convergence time. However, given the any-time character of
evolutionary methods good answers are still returned comparatively quickly.
This makes eRDF an interesting alternative for exploration and discovery for
the Web of Data.
Robustness Although SPARQL has been developed as an RDF query lan-
guage for Web data, there is a discrepancy between the database like query
formalism and the adaptive, open-world, incoherent and inconsistent character
of the Web of Data. Schemas are often unknown, and posing promising queries
requires explicit knowledge of the structure of the information. The eect of this
is that many good answers are missed as queries are simply not adequate for
certain information needs. eRDF does not extend SPARQL but releases some Se-
mantic constraints if required by the application. This makes it more robust for
querying unknown information, which is essential for exploration and discovery.
2 The eRDF infrastructure at a glance
In [8,7] we introduced RDF query answering by evolutionary algorithms (eRDF).
The basic idea is simple: instead of indexing the triples and joining results of
ground queries, we guess a population of candidate solutions. Those are then
2
Page 3
improved by the classical mutation operation guided by a tness function which,
roughly said, calculates a distance of a candidate from being a solution. This dis-
tance can simply be the number of invalid triples in our solution, or more com-
plex combinations of such simple metrics with user-dened similarity measures.1
Based on such well-dened, and user-specied, notions of similarity eRDF re-
turns \perfect" answers if possible, and approximate answers if necessary, which
is exactly what is required for discovery queries.
The input to eRDF is a standard SPARQL query. Currently, we limit our
system on answering select queries making use of one or more where clauses
of simple graph patterns. As the time of writting, simple filter expressions
assessing the equivalence of terms are also being implemented.
Let us consider a query with its set G of graph patterns, its set F of lter
constraints and its set V = f?v1; : : : ; ?vng of variables to instanciate. A solution
to that query is a mapping : V 7! I LB associating to every variable of G
a URI, blank node or literal taken respectively from the sets I, B and L of all
the URIs, Bnodes and Literals present in at list one of the endpoint. Note that
only an abstract representation of those sets is actually provided by the data
layer, they are not actually created.
We will later need the notion of a candidate solution to a SPARQL query
which is simply any mapping which assigns to every variable a node from one
of the graphs.
Evolutionary algorithm An evolutionary algorithm is a population based
heuristic. A set of candidate solution is improved in a generational process. Our
proposed method makes use the fact that we can rank approximate solutions
according to their similarity wrt. a perfect answer, in order to pick the candidates
that we consider as o-spring for new generations. First, let us describe the
general evolutionary algorithm.
The evolutionary algorithm presented in this paper considers a set of can-
didate solutions P = fig; i 2 [1; jP j] as its \Population". During the iterative
optimisation process, the content of the population is improved by replacing all
but one candidate solutions (the \individuals" of the population) by better ones.
Evolutionary loops usually consists of the following steps: create an initial pop-
ulation, generate a new generation, and select the best solutions to be the new
generation and loop to the generation of new individuals [5]. Within this loop,
several operators may be used to obtain dierent behaviours of the evolutionary
process. Our algorithm uses a (1,10)-ES evolutionary strategy[5] meaning that
at every generation 10 candidates solutions are produceed and only the best one
survives. The generation of new individuals is driven by a local search heuristic:
every new candidate solution is a slightly altered version of the best solution
found in the previous generation.
1 Arguably, dierent user needs, and dierent data sets require dierent notions of
approximation, and our evolutionary querying paradigm of eRDF is particularly
suitable for integrating such dierent notions by combining user specied similarity
measures within one querying paradigm.
3
roughly said, calculates a distance of a candidate from being a solution. This dis-
tance can simply be the number of invalid triples in our solution, or more com-
plex combinations of such simple metrics with user-dened similarity measures.1
Based on such well-dened, and user-specied, notions of similarity eRDF re-
turns \perfect" answers if possible, and approximate answers if necessary, which
is exactly what is required for discovery queries.
The input to eRDF is a standard SPARQL query. Currently, we limit our
system on answering select queries making use of one or more where clauses
of simple graph patterns. As the time of writting, simple filter expressions
assessing the equivalence of terms are also being implemented.
Let us consider a query with its set G of graph patterns, its set F of lter
constraints and its set V = f?v1; : : : ; ?vng of variables to instanciate. A solution
to that query is a mapping : V 7! I LB associating to every variable of G
a URI, blank node or literal taken respectively from the sets I, B and L of all
the URIs, Bnodes and Literals present in at list one of the endpoint. Note that
only an abstract representation of those sets is actually provided by the data
layer, they are not actually created.
We will later need the notion of a candidate solution to a SPARQL query
which is simply any mapping which assigns to every variable a node from one
of the graphs.
Evolutionary algorithm An evolutionary algorithm is a population based
heuristic. A set of candidate solution is improved in a generational process. Our
proposed method makes use the fact that we can rank approximate solutions
according to their similarity wrt. a perfect answer, in order to pick the candidates
that we consider as o-spring for new generations. First, let us describe the
general evolutionary algorithm.
The evolutionary algorithm presented in this paper considers a set of can-
didate solutions P = fig; i 2 [1; jP j] as its \Population". During the iterative
optimisation process, the content of the population is improved by replacing all
but one candidate solutions (the \individuals" of the population) by better ones.
Evolutionary loops usually consists of the following steps: create an initial pop-
ulation, generate a new generation, and select the best solutions to be the new
generation and loop to the generation of new individuals [5]. Within this loop,
several operators may be used to obtain dierent behaviours of the evolutionary
process. Our algorithm uses a (1,10)-ES evolutionary strategy[5] meaning that
at every generation 10 candidates solutions are produceed and only the best one
survives. The generation of new individuals is driven by a local search heuristic:
every new candidate solution is a slightly altered version of the best solution
found in the previous generation.
1 Arguably, dierent user needs, and dierent data sets require dierent notions of
approximation, and our evolutionary querying paradigm of eRDF is particularly
suitable for integrating such dierent notions by combining user specied similarity
measures within one querying paradigm.
3
Page 4
Fig. 1 gives an overview over our general loop: from a user input we issue a
discovery SPARQL query to eRDF for which expect to get approximate answers.
Hereafter, we describe the individual steps in more detail.
Data Layer
SE1
Cache
?
?
SE2 SE3
Discovery Query
User Input
candidate solutions Select best create offspring
Evolutionary loop
Validate
Result bucket
1 2
3 4
Fig. 1. The
ow of candidate solutions as dened by the evolutionary algorithm.
Ê Initialisation The population is initialised with some candidate solutions.
Traditionally within the EA community, the candidate solutions are random
solutions created from the search space. We have choosen instead to initialise
the population to default solutions where all the variable are not bound this
results in less queries being issued.
Ë Validation The validation step consists in the evaluation of the candidate
solutions. For every one of them, a quality score fitness() is computed based
on the quality of the bindings contains.
fitness() =
1
jV j
X
?vi2V
reward(?vi)
Any kind of rewarding scheme can be used under the only constraint that
reward(?v) 2 [0; 1]. A value of 0 denotes a bad assignment, a value of 1 a perfect
one. Any intermediate value denotes a binding that partially fullls the objective.
For instance, let us consider a rewarding scheme based on the quality of the
query graph instantiated with the candidate solution: for each triple a score
4
discovery SPARQL query to eRDF for which expect to get approximate answers.
Hereafter, we describe the individual steps in more detail.
Data Layer
SE1
Cache
?
?
SE2 SE3
Discovery Query
User Input
candidate solutions Select best create offspring
Evolutionary loop
Validate
Result bucket
1 2
3 4
Fig. 1. The
ow of candidate solutions as dened by the evolutionary algorithm.
Ê Initialisation The population is initialised with some candidate solutions.
Traditionally within the EA community, the candidate solutions are random
solutions created from the search space. We have choosen instead to initialise
the population to default solutions where all the variable are not bound this
results in less queries being issued.
Ë Validation The validation step consists in the evaluation of the candidate
solutions. For every one of them, a quality score fitness() is computed based
on the quality of the bindings contains.
fitness() =
1
jV j
X
?vi2V
reward(?vi)
Any kind of rewarding scheme can be used under the only constraint that
reward(?v) 2 [0; 1]. A value of 0 denotes a bad assignment, a value of 1 a perfect
one. Any intermediate value denotes a binding that partially fullls the objective.
For instance, let us consider a rewarding scheme based on the quality of the
query graph instantiated with the candidate solution: for each triple a score
4
Page 5
is calculated depending on whether it exists (at least partially) in one of the
data-stores or not. This reward scheme is currently used in our prototype.
For a graph pattern g = h?v1; p; ?v2i and the candidate solution : f?v1 =
foo; ?v2 = barg. The reward given to ?v1 and ?v2 will be maximum if the
instanciation of the graph pattern is a valid triple. This reward will be 0 if that
generated triple does not exists. The reward for ?v1 is computed as :
rewardg(v1) =
8
<
:
1 if ask (hfoo; p; bari) = >
0:5 else if ask (hfoo; p; ?whateveri) = >
0 otherwise
the reward received for every graph pattern rewardg(?v) are then averaged to
obtain a global reward reward(?v).
In these two equations, the ask denotes a standard ASK query expressed
in SPARQL. That query is sent to the datalayer which in turn send it to the
dierent SPARQL endpoints it is connected to. The variable \?whatever" is used
as a wildcard to test the partial validity of a triple.
As these ask queries are simple validation queries on a single graph pattern,
this step is very ecient. We also make use of a cache to further optimise this
verication.
Ì Selection The tness value of the candidate solutions is used to rank them.
Considering two candidate solutions and 0, is better than if 0 fitness() >
fitness(0). According to our (1,10)-ES selection strategy, all the candidates
solutions are sorted and the population is reduced to just one element; ie. only
the best individual survives. That best individual is also copied to a result bucket,
waiting there to be fetched by the client. This buering strategy allows us to
always return the best-so-far solution to the client.
A candidate solution that managed to stay the best one of the population for
5 consecutives generations is assumed to be (locally) optimal. When encountered,
such a solution is stored in a tabbu list and the search continues in another part
of the search space.
Í Create ospring In this step, the best candidate solution found on the
previous step is modied with the hope of improving it. This candidate solution
is modied (mutated) 10 times to create 10 new candidate solutions. In order to
create a new candidate solutions, some of the variables are given a new value.
For instance, 0 : f?v1 = something; ?v2 = barg and 00 : f?v1 = dummy; ?v2 =
barg could be created by mutation of : f?v1 = foo; ?v2 = barg.
The choice of keeping the value associated to a variable or changing it depends
on the reward that assignment received in the validation phase. The lower that
reward was, the higher are the chances for that variable to be mutated. The
mutation itself consists of the assignment of a new value to a variable. For
instance, ?v1 = foo !?v1 = dummy. That new value is picked up at random
from one of the data-stores.
5
data-stores or not. This reward scheme is currently used in our prototype.
For a graph pattern g = h?v1; p; ?v2i and the candidate solution : f?v1 =
foo; ?v2 = barg. The reward given to ?v1 and ?v2 will be maximum if the
instanciation of the graph pattern is a valid triple. This reward will be 0 if that
generated triple does not exists. The reward for ?v1 is computed as :
rewardg(v1) =
8
<
:
1 if ask (hfoo; p; bari) = >
0:5 else if ask (hfoo; p; ?whateveri) = >
0 otherwise
the reward received for every graph pattern rewardg(?v) are then averaged to
obtain a global reward reward(?v).
In these two equations, the ask denotes a standard ASK query expressed
in SPARQL. That query is sent to the datalayer which in turn send it to the
dierent SPARQL endpoints it is connected to. The variable \?whatever" is used
as a wildcard to test the partial validity of a triple.
As these ask queries are simple validation queries on a single graph pattern,
this step is very ecient. We also make use of a cache to further optimise this
verication.
Ì Selection The tness value of the candidate solutions is used to rank them.
Considering two candidate solutions and 0, is better than if 0 fitness() >
fitness(0). According to our (1,10)-ES selection strategy, all the candidates
solutions are sorted and the population is reduced to just one element; ie. only
the best individual survives. That best individual is also copied to a result bucket,
waiting there to be fetched by the client. This buering strategy allows us to
always return the best-so-far solution to the client.
A candidate solution that managed to stay the best one of the population for
5 consecutives generations is assumed to be (locally) optimal. When encountered,
such a solution is stored in a tabbu list and the search continues in another part
of the search space.
Í Create ospring In this step, the best candidate solution found on the
previous step is modied with the hope of improving it. This candidate solution
is modied (mutated) 10 times to create 10 new candidate solutions. In order to
create a new candidate solutions, some of the variables are given a new value.
For instance, 0 : f?v1 = something; ?v2 = barg and 00 : f?v1 = dummy; ?v2 =
barg could be created by mutation of : f?v1 = foo; ?v2 = barg.
The choice of keeping the value associated to a variable or changing it depends
on the reward that assignment received in the validation phase. The lower that
reward was, the higher are the chances for that variable to be mutated. The
mutation itself consists of the assignment of a new value to a variable. For
instance, ?v1 = foo !?v1 = dummy. That new value is picked up at random
from one of the data-stores.
5
Page 6
3 Implementation and Infrastructure Setup
In the following, we detail the implementation of eRDF and the infrastructure
setup for the BTC. The core of eRDF is implemented in Java 1.6. We built it
upon well know frameworks and toolkits. Jena ARQ[2] is used to parse SPARQL
queries, the evolutionary loop is based on ECJ[4] and a RESTful query interface
is powered by RESTlet[10]. The source code of eRDF is publicly available under
a GPL license.
eRDF along with the web application detailed below are run on a machine
with a 2.8Ghz Dual-Core AMD Opteron(tm) Processor, 512M of RAM and
200GB of storage. eRDF requires roughly 40Mb of RAM to run over the BTC
data set. In addition to the BTC data set, we congured eRDF to run over the
following publicly available SPARQL endpoints:
Source Endpoint Number of triples
US Census data http://www.rdfabout.com/sparql/ 1000 million
DBPedia http://dbpedia.org/sparql/ 274 million
MusicBrainz http://dbtune.org/musicbrainz/sparql/ 36 million
DBLP http://dblp.l3s.de/d2r/sparql/ 10 million
WordNet http://wordnet.rkbexplorer.com/sparql/ 2 million
Revyu http://revyu.com/sparql/ unknown
CIA World Factbook http://www4.wiwiss.fu-berlin.de/factbook/sparql/ unknown
While the BTC data set includes some parts of this data set this is not know
to eRDF, therefore, eRDF operates over well-above 2 billion triples. We now
discuss our setup for hosting the BTC dataset.
For the BTC, we used a high performance server with 8 processors, 32 GB
of RAM and a 4.6 TB of storage. Each processor was 2.4 Ghz Quad-Core AMD
Opteron(tm) Processor. For triple storage, we selected the quad-store, 4store[6].
We ran 58 instances of 4store on a single server where each instance exposed
roughly 20 million triples. There were two reasons to run this number of in-
stances: to test the distributed query capabilities of eRDF; and to ameliorate
the performance degradation that 4store experiences with large numbers of prop-
erties, which is one of the characteristics of the BTC dataset.
4 Use Case and Discovery Frontend
To demonstrate the use of eRDF over the BTC dataset, we focused on a discovery
use case, namely, the ability to nd things that are similar to or \like" a given
entity. We term such a query a like-search. For example, a user that is new to
a city may want to nd people like themselves. Alternatively, a user may want
to nd a holiday destination that is similar to the one they travelled to last
year. Furthermore, such queries may be useful for nding unrealized connections
between entities. Take for instance, a social scientists who wants to characterize
a particular university. Often this is done by grouping universities together on
various dimensions, for example, student population size, number of faculty, etc.
. However, by rst performing a like-search, the scientist may nd dimensions
6
In the following, we detail the implementation of eRDF and the infrastructure
setup for the BTC. The core of eRDF is implemented in Java 1.6. We built it
upon well know frameworks and toolkits. Jena ARQ[2] is used to parse SPARQL
queries, the evolutionary loop is based on ECJ[4] and a RESTful query interface
is powered by RESTlet[10]. The source code of eRDF is publicly available under
a GPL license.
eRDF along with the web application detailed below are run on a machine
with a 2.8Ghz Dual-Core AMD Opteron(tm) Processor, 512M of RAM and
200GB of storage. eRDF requires roughly 40Mb of RAM to run over the BTC
data set. In addition to the BTC data set, we congured eRDF to run over the
following publicly available SPARQL endpoints:
Source Endpoint Number of triples
US Census data http://www.rdfabout.com/sparql/ 1000 million
DBPedia http://dbpedia.org/sparql/ 274 million
MusicBrainz http://dbtune.org/musicbrainz/sparql/ 36 million
DBLP http://dblp.l3s.de/d2r/sparql/ 10 million
WordNet http://wordnet.rkbexplorer.com/sparql/ 2 million
Revyu http://revyu.com/sparql/ unknown
CIA World Factbook http://www4.wiwiss.fu-berlin.de/factbook/sparql/ unknown
While the BTC data set includes some parts of this data set this is not know
to eRDF, therefore, eRDF operates over well-above 2 billion triples. We now
discuss our setup for hosting the BTC dataset.
For the BTC, we used a high performance server with 8 processors, 32 GB
of RAM and a 4.6 TB of storage. Each processor was 2.4 Ghz Quad-Core AMD
Opteron(tm) Processor. For triple storage, we selected the quad-store, 4store[6].
We ran 58 instances of 4store on a single server where each instance exposed
roughly 20 million triples. There were two reasons to run this number of in-
stances: to test the distributed query capabilities of eRDF; and to ameliorate
the performance degradation that 4store experiences with large numbers of prop-
erties, which is one of the characteristics of the BTC dataset.
4 Use Case and Discovery Frontend
To demonstrate the use of eRDF over the BTC dataset, we focused on a discovery
use case, namely, the ability to nd things that are similar to or \like" a given
entity. We term such a query a like-search. For example, a user that is new to
a city may want to nd people like themselves. Alternatively, a user may want
to nd a holiday destination that is similar to the one they travelled to last
year. Furthermore, such queries may be useful for nding unrealized connections
between entities. Take for instance, a social scientists who wants to characterize
a particular university. Often this is done by grouping universities together on
various dimensions, for example, student population size, number of faculty, etc.
. However, by rst performing a like-search, the scientist may nd dimensions
6
Page 7
that she or he may not have considered, for example, size of sports teams. These
are just some examples of like-searches.
Fig. 2. Finding entities similar to Tim Berners Lee
In order to demonstrate like-searches, we built a web application, Like? ,
over the top of eRDF. The interface to the application is shown in Figure 2. It
shows the results of a like-search on Tim Berners Lee. To the left of the screen,
the user can enter full text queries. These queries are forwarded by Like? to
the semantic search engine Sindice [3]. From the response, the rst result is
selected and the RDF-document it refers to is retrieved. This document is then
transformed into a SPARQL query, which is issued to eRDF. The SPARQL query
produced in the case of the Tim Berners Lee query contains 23 graph patterns.
The document selected for the like-search is displayed to the user at the top of
the page, in this case, the DBpedia page describing Tim Berners Lee. Answers
are displayed in the right half of the screen along with the number of triples,
shown in parenthesis, that matched the given query. Users can hover over each
answer to see the property and object of the triple. Finally, Like? will continue
to return possible answers to the query until the user clicks stop or a server-side
timeout is reached.
We note that the answers given by eRDF are both expected and novel 2.
eRDF obviously nds Tim Berners Lee as being like himself. Another better
2 Isn't Tim Berners Lee the David Beckham of computer science?
7
are just some examples of like-searches.
Fig. 2. Finding entities similar to Tim Berners Lee
In order to demonstrate like-searches, we built a web application, Like? ,
over the top of eRDF. The interface to the application is shown in Figure 2. It
shows the results of a like-search on Tim Berners Lee. To the left of the screen,
the user can enter full text queries. These queries are forwarded by Like? to
the semantic search engine Sindice [3]. From the response, the rst result is
selected and the RDF-document it refers to is retrieved. This document is then
transformed into a SPARQL query, which is issued to eRDF. The SPARQL query
produced in the case of the Tim Berners Lee query contains 23 graph patterns.
The document selected for the like-search is displayed to the user at the top of
the page, in this case, the DBpedia page describing Tim Berners Lee. Answers
are displayed in the right half of the screen along with the number of triples,
shown in parenthesis, that matched the given query. Users can hover over each
answer to see the property and object of the triple. Finally, Like? will continue
to return possible answers to the query until the user clicks stop or a server-side
timeout is reached.
We note that the answers given by eRDF are both expected and novel 2.
eRDF obviously nds Tim Berners Lee as being like himself. Another better
2 Isn't Tim Berners Lee the David Beckham of computer science?
7
Page 8
example, is Wendy Hall, who like Tim Berners Lee, is a well known British com-
puter scientist who also holds a professorship at the University of Southampton
and is a fellow of the Royal Academy of Engineering. Some more novel examples
are Ada Lovelace and Thomas Malthus both of whom are English scientists as
well Danny O'Brien an English technology journalist who blogs.
While Like? is in its early prototype stages, we believe it successfully demon-
strates the use of eRDF for a discovery over the Web of Data. Like? is accessible
at http://ai01.cs.vu.nl/erdfwww/likeengine/.
5 Conclusion
As the Web of Data continues to grow, it is becoming increasingly necessary to
take a distributed approach to using it: acquiring data from sources on-demand
in a robust fashion. Additionally, because of its size, new techniques need to be
developed to discover the information that is there. The eRDF infrastructure
provides a novel evolutionary technique to enable discovery over distributed
SPARQL endpoints in a robust fashion. In this paper, we described eRDF and
its usage for Like? , an application for nding similar things. This application
runs over live data sources as well as the BTC data set.
References
1. Ontotext AD. Owl semantic repository. http://www.ontotext.com/owlim/.
2. Hewlett-Packard Development Company. ARQ - a sparql processor for jena. http:
//jena.sourceforge.net/ARQ/.
3. Digital Enterprise Research Institute (DERI). Sindice - the semantic web index.
http://sindice.com/.
4. George Mason University's ECLab. ECJ - a java-based evolutionary computation
research system. http://cs.gmu.edu/~eclab/projects/ecj/.
5. A. E. Eiben and J. E. Smith. Introduction to Evolutionary Computing (Natural
Computing Series). Springer, October 2008.
6. Garlik. 4store database. http://4store.org/.
7. Christophe Gueret, Eyal Oren, Stefan Schlobach, and Martijn Schut. An evolu-
tionary perspective on approximate rdf query answering. In SUM, volume 5291 of
Lecture Notes in Computer Science, pages 215{228. Springer, 2008.
8. Eyal Oren, Christophe Gueret, and Stefan Schlobach. Anytime query answering in
rdf through evolutionary algorithms. In International Semantic Web Conference,
volume 5318 of Lecture Notes in Computer Science, pages 98{113. Springer, 2008.
9. OpenLink Software. Virtuoso dbms. http://virtuoso.openlinksw.com/.
10. Noelios Technologies. Restlet - lightweight REST framework. http://www.
restlet.org/.
8
puter scientist who also holds a professorship at the University of Southampton
and is a fellow of the Royal Academy of Engineering. Some more novel examples
are Ada Lovelace and Thomas Malthus both of whom are English scientists as
well Danny O'Brien an English technology journalist who blogs.
While Like? is in its early prototype stages, we believe it successfully demon-
strates the use of eRDF for a discovery over the Web of Data. Like? is accessible
at http://ai01.cs.vu.nl/erdfwww/likeengine/.
5 Conclusion
As the Web of Data continues to grow, it is becoming increasingly necessary to
take a distributed approach to using it: acquiring data from sources on-demand
in a robust fashion. Additionally, because of its size, new techniques need to be
developed to discover the information that is there. The eRDF infrastructure
provides a novel evolutionary technique to enable discovery over distributed
SPARQL endpoints in a robust fashion. In this paper, we described eRDF and
its usage for Like? , an application for nding similar things. This application
runs over live data sources as well as the BTC data set.
References
1. Ontotext AD. Owl semantic repository. http://www.ontotext.com/owlim/.
2. Hewlett-Packard Development Company. ARQ - a sparql processor for jena. http:
//jena.sourceforge.net/ARQ/.
3. Digital Enterprise Research Institute (DERI). Sindice - the semantic web index.
http://sindice.com/.
4. George Mason University's ECLab. ECJ - a java-based evolutionary computation
research system. http://cs.gmu.edu/~eclab/projects/ecj/.
5. A. E. Eiben and J. E. Smith. Introduction to Evolutionary Computing (Natural
Computing Series). Springer, October 2008.
6. Garlik. 4store database. http://4store.org/.
7. Christophe Gueret, Eyal Oren, Stefan Schlobach, and Martijn Schut. An evolu-
tionary perspective on approximate rdf query answering. In SUM, volume 5291 of
Lecture Notes in Computer Science, pages 215{228. Springer, 2008.
8. Eyal Oren, Christophe Gueret, and Stefan Schlobach. Anytime query answering in
rdf through evolutionary algorithms. In International Semantic Web Conference,
volume 5318 of Lecture Notes in Computer Science, pages 98{113. Springer, 2008.
9. OpenLink Software. Virtuoso dbms. http://virtuoso.openlinksw.com/.
10. Noelios Technologies. Restlet - lightweight REST framework. http://www.
restlet.org/.
8
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
2 Readers on Mendeley
by Discipline
by Academic Status
50% Post Doc
50% Researcher (at an Academic Institution)
by Country
50% Netherlands
50% United States



