Sign up & Download
Sign in

A Survey of Web Information Extraction Systems

by C H Chang, M Kayed, R Girgis, K F Shaalan
IEEE Transactions on Knowledge and Data Engineering ()

Abstract

The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability of robust, flexible information extraction (IE) systems that transform the Web pages into program-friendly structures such as a relational database will become a great necessity. Although many approaches for data extraction from Web pages have been developed, there has been limited effort to compare such tools. Unfortunately, in only a few cases can the results generated by distinct tools be directly compared since the addressed extraction tasks are different. This paper surveys the major Web data extraction approaches and compares them in three dimensions: the task domain, the automation degree, and the techniques used. The criteria of the first dimension explain why an IE system fails to handle some Web sites of particular structures. The criteria of the second dimension classify IE systems based on the techniques used. The criteria of the third dimension measure the degree of automation for IE systems. We believe these criteria provide qualitatively measures to evaluate various IE approaches

Cite this document (BETA)

Available from ieeexplore.ieee.org
Page 1
hidden

A Survey of Web Information Extra...

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TKDE-0475-1104.R3 1 A Survey of Web Information Extraction Systems Chia-Hui Chang, Mohammed Kayed, Moheb Ramzy Girgis, Khaled Shaalan Abstract���The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability of robust, flexible Information Extraction (IE) systems that transform the Web pages into program-friendly structures such as a relational database will become a great necessity. Although many approaches for data extraction from Web pages have been developed, there has been limited effort to compare such tools. Unfortunately, in only a few cases can the results generated by distinct tools be directly compared since the addressed extraction tasks are different. This paper surveys the major Web data extraction approaches and compares them in three dimensions: the task domain, the automation degree, and the techniques used. The criteria of the first dimension explain why an IE system fails to handle some Web sites of particular structures. The criteria of the second dimension classify IE systems based on the techniques used. The criteria of the third dimension measure the degree of automation for IE systems. We believe these criteria provide qualitatively measures to evaluate various IE approaches. Index Terms���Information Extraction, Web Mining, Wrapper, Wrapper Induction. ������������������������������ �� ������������������������������ 1 INTRODUCTION HE growth and popularity of the world-wide has resulted in a huge amount of information sources on the Internet. However, due to the heteroge- neity and the lack of structure of Web information sources, access to this huge collection of information has been lim- ited to browsing and searching. Sophisticated Web mining applications, such as comparison shopping robots, require expensive maintenance to deal with different data formats. To automate the translation of input pages into structured data, a lot of efforts have been devoted in the area of infor- mation extraction (IE). Unlike information retrieval (IR), which concerns how to identify relevant documents from a document collection, IE produces structured data ready for post-processing, which is crucial to many applications of Web mining and searching tools. Formally, an IE task is defined by its input and its extrac- tion target. The input can be unstructured documents like free text that are written in natural language (e.g. Figure 1) or the semi-structured documents that are pervasive on the Web, such as tables or itemized and enumerated lists (e.g. Figure 2). The extraction target of an IE task can be a rela- tion of k-tuple (where k is the number of attributes in a re- cord) or it can be a complex object with hierarchically or- ganized data. For some IE tasks, an attribute may have zero (missing) or multiple instantiations in a record. The diffi- culty of an IE task can be further complicated when various permutations of attributes or typographical errors occur in the input documents. Programs that perform the task of IE are referred to as extractors or wrappers. A wrapper was originally defined as a component in an information integration system which aims at providing a single uniform query interface to access multiple information sources. In an information integration system, a wrapper is generally a program that ���wraps��� an information source (e.g. a database server, or a Web server) such that the information integration system can access that information source without changing its core query answer- ing mechanism. In the case where the information source is a Web server, a wrapper must query the Web server to col- lect the resulting pages via HTTP protocols, perform infor- mation extraction to extract the contents in the HTML documents, and finally integrate with other data sources. Among the three procedures, information extraction has received most attentions and some use wrappers to denote extractor programs. Therefore, we use the terms extractors and wrappers interchangeably. Wrapper induction (WI) or information extraction (IE) systems are software tools that are designed to generate wrappers. A wrapper usually performs a pattern matching procedure (e.g., a form of finite-state machines) which relies on a set of extraction rules. Tailoring a WI system to a new requirement is a task that varies in scale depending on the text type, domain, and scenario. To maximize reusability and minimize maintenance cost, designing a trainable WI system has been an important topic in the research fields of message understanding, machine learning, data mining, etc. The task of Web IE, that we are concerned in this paper, differs largely from traditional IE tasks in that traditional IE aims at extracting data from totally unstructured free texts that are written in natural language. Web IE, in contrast, processes online documents that are semi-structured and usually generated automatically by a server-side applica- tion program. As a result, traditional IE usually takes ad- vantage of NLP techniques such as lexicons and grammars, whereas Web IE usually applies machine learning and pat- xxxx-xxxx/0x/$xx.00 �� 200x IEEE ������������������������������������������������ ��� Chia-Hui Chang is with the Department of Computer Science and Informa- tion Engineering, National Central University, No. 300, Jungda Rd., Jhongli City, Taoyuan, Taiwan 320, R.O.C., E-mail: chia@csie.ncu.edu.tw. ��� Mohammed Kayed is with the Mathematics Department, Beni-Suef Uni- versity, Egypt, E-mail: mskayed@yahoo.com. ��� Moheb Ramzy Girgis is with the Department of Computer Science, Minia University, El-Minia, Egypt, E-mail: mrgirgis@mailer.eun.eg. ��� Khaled Shaalan is with The British University in Dubai (BUiD), United Arab Emirates, E-mail: khaled.shaalan@buid.ac.ae. Manuscript received (insert date of submission if desired). Please note that all acknowledgments should be placed at the end of the paper, before the bibliography. Twebexplosive

Readership Statistics

169 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
36% Ph.D. Student
 
17% Student (Master)
 
12% Researcher (at an Academic Institution)
by Country
 
12% Germany
 
12% United States
 
8% United Kingdom

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in