Sign up & Download
Sign in

Building a Knowledge Base with Data Crawled from Semantic Web

by Ivo Lašek, Peter Vojtáš
Knowledge Creation Diffusion Utilization (2011)

Abstract

In this paper, we compare various approaches to semantic web data crawling. We introduce our crawling framework, which enables us to organize and clean the data before they are presented to the end user or used as a knowledge base. We present methods of semantic data cleaning in order to keep the knowledge base consistent. We used the proposed framework to build a knowledge base containing data about persons crawled from semantic web data sources. In this paper we present the results of the crawling process.

Author-supplied keywords

Cite this document (BETA)

Available from research.i-lasek.cz
Page 1
hidden

Building a Knowledge Base with Data Crawled from Semantic Web

J. Zendulka, M. Rychlý (eds.), DATAKON 2011, Mikulov, 15-18.10.2011, pp. 1-10.
Building a Knowledge Base with Data Crawled from
Semantic Web
Ivo LAŠEK1, Peter VOJTÁŠ2
1Katedra softwarového inženýrství, FIT ČVUT v Praze
Kolejní 2, 160 00 Praha 6
lasekivo@fit.cvut.cz
2Katedra softwarového inženýrství, MFF UK Praha
Malostranské nám. 25, 118 00 Praha
vojtas@ksi.ms.mff.cuni.cz

Abstract. In this paper, we compare various approaches to semantic web data
crawling. We introduce our crawling framework, which enables us to organize and
clean the data before they are presented to the end user or used as a knowledge base.
We present methods of semantic data cleaning in order to keep the knowledge base
consistent. We used the proposed framework to build a knowledge base containing
data about persons crawled from semantic web data sources. In this paper we present
the results of the crawling process.
Keywords: semantic web, search, crawling
1 Introduction
Nowadays, conventional search engines still present their results rather in a document
centric way. Both most popular search engines (Google and Yahoo!) are already able to
parse some semantic information, but we can’t perform any structured queries over the
collected data. By Google, this functionality is called Google Rich Snippets [1] and the data
extracted from semantic documents is displayed as a grey information snippet under each
search result (in case that the target document contains some structured data). Yahoo! offers
very similar functionality under the name SearchMonkey [2]. Both extract the structured
data from variety of sources using standard markup and vocabularies (e.g. Microdata,
Microformats, RDFa, eRDF). Google as well as Yahoo! states that they use these data to
enhance their search results. Yahoo! goes even further and provides developers with the
possibility to develop their own applications exploiting such a structured data.
Many good managed public structured data sources have emerged like DBpedia1 or
FreeBase2. Popular web pages present part of their data in a structured form. For example
LinkedIn3 uses Microformats to publish contacts of registered users, or basic information
about their employment history. There are initiatives demanding the government data to be
published as Linked Data. In Czech Republic this interest is represented by the initiative
OpenData.cz4.

1 http://dbpedia.org
2 http://www.freebase.com
3 http://www.linkedin.com
4 http://opendata.cz
Page 2
hidden
Building a Knowledge Base with Data Crawled from Semantic Web
These and many other examples lead us to the idea to form a knowledge base composed
of the data collected from such structured data sources. The amount of the data already
available in a structured form on the web offers the possibility to form huge knowledge
bases with fresh data published every day.
However, conventional search engines don't combine the information from multiple
sources to form a completely new data source, to find even some new facts. The data with a
semantic meaning is used rather as a better form of keywords describing the content of an
unstructured document.
For last couple of years, a new form of a search engine has emerged. In this article, we
will call it the semantic search engine. Semantic search engines address the problems of
crawling and integration of the data obtained from the semantic web. So far, most of the
work focused on crawling as much data as possible and providing some kind of a simple
user interface to search them. In most of the cases the interface composes of an ordinary
search field.
Little or no stress was put on the quality of the crawled data. The search results of the
semantic search engines are full of duplicates, which makes it difficult to locate relevant
data sources. The documents presented in search results contain often almost no relevant
information apart from its name represented by the label.
In this article we introduce our framework for crawling semantic data from the web. We
address the problem of the quality of the data. The results of a crawling task performed by
our framework are introduced. We present basic statistics of the crawled data and evaluate
its quality. We provide an important view of the nature of the data that can be obtained
from semantic data sources (Linked Data). The results of our work can serve as a base for
designing specific knowledge bases composed of Linked Data.
2 Related work
Semantic search engines can be classified into two categories as stated in [9]. Systems that
operate on a document abstraction level (we call them document centric) and systems that
operate on an object oriented model (we call them entity centric).
2.1 Document Centric Approach
The representatives of this group are Swoogle [7] and Sindice [14]. While Swoogle does
not seem to display continuous crawling capabilities, Sindice is actively developed and
serves as a service for public use.
The information in a document centric semantic search engine is organized by
documents - i.e. the data sources, where the information comes from. A document can be a
single web page, an RDF document, an RSS feed etc. The document centric search engines
return documents as search results. It does not matter, how many entities the document
describes. Also, when one entity is described in multiple documents, the search engine
returns them separately. Then, the user has to merge the information manually. From this
point of view, the entity centric approach seems more promising.
2.2 Entity Centric Approach
An entity centric semantic search engine works on the higher abstraction level. The
information from several documents describing the same thing is aggregated as one entity.
Such entity is thus better described, because the search engine collects data from multiple
semantic documents.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

3 Readers on Mendeley
by Discipline
 
by Academic Status
 
33% Student (Bachelor)
 
33% Student (Master)
 
33% Ph.D. Student
by Country
 
33% Japan
 
33% United States
 
33% Portugal