Sign up & Download
Sign in

Semantic Data Extraction from Infobox Wikipedia Template

by Amira AbdEl-atey, Sherif El-etriby, Arabi S Kishk
International Journal of Computer Applications ()

Abstract

Wikis are established means for collaborative authoring, versioning and publishing of textual articles. The Wikipedia for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Wikis are created by wiki software and are often used to create collaborative works. One of the key challenges of computer science is answering rich queries. Several approaches have been proposed on how to extend wikis to allow the creation of structured and semantically enriched content. Semantic web allows of creation of such web. Also, Semantic web contents help us to answer rich queries. One of the new applications in semantic web is DBpedia. DBpedia project focus on creating semantically enriched structured information of Wikipedia. In this article, we describe and clarify the DBpedia project. We test the project to get structured data as triples from some Wikipedia resources. We clarify examples of car resource and Berlin resource. The output data is in RDF (Resource Description Framework) triple format which is the basic technology used for building the semantic web. We can answer rich queries by making use of semantic web structure.

Cite this document (BETA)

Available from research.ijcaonline.org
Page 1
hidden

Semantic Data Extraction from Inf...

International Journal of Computer Applications (0975 ��� 8887) Volume 40��� No.17, February 2012 18 Semantic Data Extraction from Infobox Wikipedia Template Amira Abd El-atey Faculty of computers and Information Menofia University Sherif El-etriby Faculty of computers and Information Menofia University Arabi S. kishk Faculty of computers and Information Menofia University ABSTRACT Wikis are established means for collaborative authoring, versioning and publishing of textual articles. The Wikipedia for example, succeeded in creating the by far largest encyclopedia just on the basis of a wiki. Wikis are created by wiki software and are often used to create collaborative works. One of the key challenges of computer science is answering rich queries. Several approaches have been proposed on how to extend wikis to allow the creation of structured and semantically enriched content. Semantic web allows of creation of such web. Also, Semantic web contents help us to answer rich queries. One of the new applications in semantic web is DBpedia. DBpedia project focus on creating semantically enriched structured information of Wikipedia. In this article, we describe and clarify the DBpedia project. We test the project to get structured data as triples from some Wikipedia resources. We clarify examples of car resource and Berlin resource. The output data is in RDF (Resource Description Framework) triple format which is the basic technology used for building the semantic web. We can answer rich queries by making use of semantic web structure. General Terms Information retrieval semantic web. Keywords Wikipedia semantic web DBpedia data extraction framework structured knowledge wikipedia templates media wiki software infobox template. 1. INTRODUCTION The free encyclopedia Wikipedia has been tremendously successful due to the ease of collaboration of its users over the Internet. The Wikipedia wiki is the representative of a new way of publishing and currently contains millions of articles. Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 18 million articles (over 3.6 million in English) have been written collaboratively by volunteers around the world and almost all of its articles can be edited by anyone with access to the site. Wikipedia was launched in 2001 by Jimmy Wales and Larry Sanger and has become the largest and most popular general reference work on the Internet, having 365 million readers. It is a natural idea to exploit this source of knowledge. Wikipedia has the problem that its search capabilities are limited to full-text search, which only allows very limited access to this valuable knowledge base [13]. The DBpedia project focuses on the task of converting Wikipedia content into structured knowledge, such that Semantic Web techniques can be employed against it, asking sophisticated queries against Wikipedia and linking it to other datasets on the Web. The project was started by people at the Free University of Berlin and the University of Leipzig, in collaboration with OpenLink Software and the first publicly available dataset was published in 2007. It is made available under free licences, allowing others to reuse the dataset [14]. Until March 2010, the DBpedia project was using a PHP- based extraction framework to extract different kinds of structured information from Wikipedia. This framework has been superseded by the new Scala-based extraction framework and the old PHP framework is not maintained anymore. The superseded PHP-based DBpedia information extraction framework is written using PHP 5. The new frame work written using Scala 2.8 is available from the DBpedia Mercurial (GNU GPL License). Wikipedia articles consist mostly of free text, but also include structured information embedded in the articles, such as "infobox" tables, categorization information, images, geo-coordinates and links to external Web pages. This structured information is extracted and put in a form which can be queried [4,8]. Semantic web is able to describe things in a way that computers can understand. Answering semantically rich queries is one of the key challenges of semantic web today [7]. In this article, we give an overview of Wikipedia and semantic web in section 2 and 3 respectively. Section 4 gives discusses structured data extraction framework of Wikipedia. Section 5 also discussed integration of semantic web data on the web. Section 6 gives overview about related work. Section 7 concludes and outlines conclusion and future work. 2. WIKIPEDIA Wikipedia articles consist mostly of free text, but also contain different types of structured information, such as infobox templates, categorization information, images, geo- coordinates and links to external Web pages. This structured information can be extracted from Wikipedia and can serve as a basis for enabling sophisticated queries against Wikipedia content. DBpedia project extracts this structured information from Wikipedia and turns it into a rich knowledge base. This knowledge base can be used later to ask sophisticated queries. In this section we give an overview of MediaWiki templates and infobox template. 2.1 MediaWiki Templates MediaWiki supports templates for Wikipedia by using MediaWiki software. The MediaWiki software is an open source software that wikiHow, Wikipedia, Wiktionary, and many other wiki sites are based upon. The wiki engine enables each member to search, read, add and edit articles, and thus improve the content of the wiki. Wiki software can be downloaded as a ready-made tool and in the majority of cases its use is free of charge [10].
Page 2
hidden
International Journal of Computer Applications (0975 ��� 8887) Volume 40��� No.17, February 2012 19 MediaWiki supports a sophisticated template mechanism to include predefined content or display content in a determined way. Some of these MediaWiki templates are input box, message box and infobox. Infobox template is intended as a meta-template. 2.2 Infobox template A special type of templates is infobox, aiming at generating consistently-formatted boxes for certain content in articles describing instances of a specific type. An infobox template is a fixed-format table designed to be added to the top right-hand corner of articles to consistently present a summary of some unifying aspect. An example of infobox template code is shown in Figure 1. It is about AlMenoufiya. As we see, the infobox is enclosed with {{ }} operators. Summary about Al Menoufiya is described as label/data rows. It describes data about Al Menoufiya such as name, image, country, area, population and other data about Al Menoufiya city. Infobox templates are used on pages describing similar content [1, 3]. The generated view of infobox code is shown in Figure 2. As we see, it visualizes the code we have described in Fig1. Other examples include Geographic entities, education, plants, organizations, people and so on. 3. SEMANTIC WEB Semantic web is a new vision of current web. It is a web that is able to describe things in a way that computers can understand. it is not about links between web pages. The Semantic Web describes the relationships between things (like A is a part of B and Y is a member of Z ) and the properties of things (like size, weight, age, and price). It has more standard and unstandard technologies. In this section we describe important semantic web technologies which are RDF and SPARQL. 3.1 The Resource Description Framework technology The RDF (Resource Description Framework) is a language for describing information and resources on the web. Putting information into RDF files, makes it possible for computer programs ("web spiders") to search, discover, pick up, collect, analyze and process information from the web. The Semantic Web uses RDF to describe web resources. RDF usually displayed as A Subject-Predicate-Object. If you used RDF for representing data, you need a way for accessing information that mirrors the flexibility of the RDF information model. RDF query languages such as SPARQL query language. RDF model for the web can be considered as the equivalent of the ER (Entity-Relationship) model for the RDBMS (relational database management system). Let���s look at a simple example. Consider the fact that ���The book was written by Jane Doe.��� In a traditional ER model, this information Country Egypt Area 2554km2 Population 2,780,153(1996) Population density 1088/Km2 Administration area Shbein elkom Postal code 23511-23754 Coordinates 30o 261 1911 North, 31o 041 0811 East Website www.menofiya.gov.eg {{Infobox |name = Al Menoufiya |image = Menofia.png |country = Al Menoufiya |area = 2554 |population = 1780153 |population as of = 1996 |population density = 1088 |administration area = shbein elkom |postal code = 23511 ��� 23754 |lat_deg = 30 |lat_min = 26 |lat_hem = North |lon_deg = 31 |lon_min = 4 |lon_hem = East |Website = [www.monofeya.gov.eg] }} Figure 2. Visualization of Infobox Template about Al Menoufiya Figure 1. Infobox template code Al Menoufiya Al Menoufiya

Readership Statistics

7 Readers on Mendeley
by Discipline
 
 
 
14% Design
by Academic Status
 
29% Other Professional
 
14% Student (Master)
 
14% Student (Bachelor)
by Country
 
57% United States
 
14% Australia
 
14% Egypt

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in