Using visual cues for extraction of tabular data from arbitrary HTML documents
Analysis (2005)
- ISBN: 1595930515
- DOI: 10.1145/1062745.1062838
Available from portal.acm.org
or
Abstract
We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page.
Available from portal.acm.org
Page 1
Using visual cues for extraction of tabular data from arbitrary HTML documents
Using Visual Cues for Extraction of Tabular Data from
Arbitrary HTML Documents
Bernhard Krüpl
Vienna University of
Technology
Institute of Information
Systems
Database and Artificial
Intelligence Group
kruepl@dbai.tuwien.ac.at
Marcus Herzog
Vienna University of
Technology
Institute of Information
Systems
Database and Artificial
Intelligence Group
herzog@dbai.tuwien.ac.at
Wolfgang Gatterbauer
Vienna University of
Technology
Institute of Information
Systems
Database and Artificial
Intelligence Group
gatter@dbai.tuwien.ac.at
ABSTRACT
We describe a method to extract tabular data from web
pages. Rather than just analyzing the DOM tree, we also
exploit visual cues in the rendered version of the document
to extract data from tables which are not explicitly marked
with an HTML table element. To detect tables, we rely on
a variant of the well-known X-Y cut algorithm as used in the
OCR community. We implemented the system by directly
accessing Mozilla’s box model that contains the positional
data for all HTML elements of a given web page.
1. INTRODUCTION
Most of today’s web documents are designed for a human
audience. Although many efforts have been undertaken to
bring explicit semantics to the Web, the vast majority of
pages is designed with a certain visual appearance in mind:
authors use HTML rather as a page layout language than
for the purpose of semantic markup. However, there is a
common misunderstanding that such pages are semantically
poor. Instead, the semantics is just shifted from an explicit
level (proper HTML or XML tags) to an implicit one: the
spatial alignment of the document text on the page.
Documents have come a long way from the mere sequen-
tial order of sentences to sophisticated layouts following dif-
ferent typesetting conventions and fashions. Therefore it
seems quite natural to exploit this additional information
for information extraction applications.
Web layouts can be achieved with different methods rang-
ing from basic HTML markup to fancy CSS stylesheets
and dynamic client-side programming. Still, most web in-
formation extraction programs operate just on the DOM
tree where the spatial information cannot by directly ac-
cessed. By utilizing the screen rendering provided by the
open source browser Mozilla we are able to exploit this spa-
tial information during the extraction process.
2. WEB PUBLICATION PROCESS
The publication of a web page can be understood as a
communication process from persons to persons (see figure
1).
Copyright is held by the author/owner(s).
WWW2005, May 10–14, 2005, Chiba, Japan.
.
Figure 1: Layers of abstraction in the web publica-
tion process
Starting at the left-hand side, an author edits a Web page,
often by using a visual editor. The result of this step is a
certain HTML (possibly including some CSS) source code.
This initial representation of the communication content is
then gradually transformed for transmission (going down the
stack of communication layers). At the receiver’s side, the
transformations are applied the other way around, moving
the information up the stack. In the last step, a web browser
creates a visual rendering from the supplied HTML code by
applying a rendering algorithm.
It is quite obvious that the format at the transport layer is
not useful at all for information extraction. The lower infor-
mation moves down in the transformation stack, the more
noise and redundancy is added. On the opposite, informa-
tion in its purest form can be found when it is as close to
the receiver as possible, in this case, the visual rendering,
not its encoding in various formats.
See [4] for an inspiring discussion how information is trans-
mitted between two persons.
Arbitrary HTML Documents
Bernhard Krüpl
Vienna University of
Technology
Institute of Information
Systems
Database and Artificial
Intelligence Group
kruepl@dbai.tuwien.ac.at
Marcus Herzog
Vienna University of
Technology
Institute of Information
Systems
Database and Artificial
Intelligence Group
herzog@dbai.tuwien.ac.at
Wolfgang Gatterbauer
Vienna University of
Technology
Institute of Information
Systems
Database and Artificial
Intelligence Group
gatter@dbai.tuwien.ac.at
ABSTRACT
We describe a method to extract tabular data from web
pages. Rather than just analyzing the DOM tree, we also
exploit visual cues in the rendered version of the document
to extract data from tables which are not explicitly marked
with an HTML table element. To detect tables, we rely on
a variant of the well-known X-Y cut algorithm as used in the
OCR community. We implemented the system by directly
accessing Mozilla’s box model that contains the positional
data for all HTML elements of a given web page.
1. INTRODUCTION
Most of today’s web documents are designed for a human
audience. Although many efforts have been undertaken to
bring explicit semantics to the Web, the vast majority of
pages is designed with a certain visual appearance in mind:
authors use HTML rather as a page layout language than
for the purpose of semantic markup. However, there is a
common misunderstanding that such pages are semantically
poor. Instead, the semantics is just shifted from an explicit
level (proper HTML or XML tags) to an implicit one: the
spatial alignment of the document text on the page.
Documents have come a long way from the mere sequen-
tial order of sentences to sophisticated layouts following dif-
ferent typesetting conventions and fashions. Therefore it
seems quite natural to exploit this additional information
for information extraction applications.
Web layouts can be achieved with different methods rang-
ing from basic HTML markup to fancy CSS stylesheets
and dynamic client-side programming. Still, most web in-
formation extraction programs operate just on the DOM
tree where the spatial information cannot by directly ac-
cessed. By utilizing the screen rendering provided by the
open source browser Mozilla we are able to exploit this spa-
tial information during the extraction process.
2. WEB PUBLICATION PROCESS
The publication of a web page can be understood as a
communication process from persons to persons (see figure
1).
Copyright is held by the author/owner(s).
WWW2005, May 10–14, 2005, Chiba, Japan.
.
Figure 1: Layers of abstraction in the web publica-
tion process
Starting at the left-hand side, an author edits a Web page,
often by using a visual editor. The result of this step is a
certain HTML (possibly including some CSS) source code.
This initial representation of the communication content is
then gradually transformed for transmission (going down the
stack of communication layers). At the receiver’s side, the
transformations are applied the other way around, moving
the information up the stack. In the last step, a web browser
creates a visual rendering from the supplied HTML code by
applying a rendering algorithm.
It is quite obvious that the format at the transport layer is
not useful at all for information extraction. The lower infor-
mation moves down in the transformation stack, the more
noise and redundancy is added. On the opposite, informa-
tion in its purest form can be found when it is as close to
the receiver as possible, in this case, the visual rendering,
not its encoding in various formats.
See [4] for an inspiring discussion how information is trans-
mitted between two persons.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
9 Readers on Mendeley
by Discipline
by Academic Status
22% Researcher (at an Academic Institution)
22% Ph.D. Student
22% Student (Master)
by Country
44% United Kingdom
33% Austria
11% Republic of Singapore


