Distance Measures for Layout-Based Document Image Retrieval
- ISSN: 10959203
- ISBN: 0769525318
- DOI: 10.1109/DIAL.2006.16
- PubMed: 17478679
Abstract
Most methods for document image retrieval rely solely on text information to find similar documents. This paper describes a way to use layout information for document image retrieval instead. A new class of distance measures is introduced for documents with Manhattan layouts, based on a two-step procedure: First, the distances between the blocks of two layouts are calculated. Then, the blocks of one layout are assigned to the blocks of the other layout in a matching step. Different block distances and matching methods are compared and evaluated using the publicly available MARG database. On this dataset, the layout type can be determined successfully in 92.6% of the cases using the best distance measure in a nearest neighbor classifier. The experiments show that the best distance measure for this task is the overlapping area combined with the Manhattan distance of the corner points as block distance together with the minimum weight edge cover matching.
Distance Measures for Layout-Based Document Image Retrieval
Joost van Beusekom, Daniel Keysers, Faisal Shafait, Thomas M. Breuel
Image Understanding and Pattern Recognition (IUPR) Research Group
German Research Center for Artificial Intelligence (DFKI)
and Technical University of Kaiserslautern
D-67663 Kaiserslautern, Germany
{joost, keysers, faisal, tmb}@iupr.net
Abstract
Most methods for document image retrieval rely solely
on text information to find similar documents. This paper
describes a way to use layout information for document im-
age retrieval instead. A new class of distance measures is
introduced for documents with Manhattan layouts, based
on a two-step procedure: First, the distances between the
blocks of two layouts are calculated. Then, the blocks of
one layout are assigned to the blocks of the other layout
in a matching step. Different block distances and match-
ing methods are compared and evaluated using the publicly
available MARG database. On this dataset, the layout type
can be determined successfully in 92.6% of the cases using
the best distance measure in a nearest neighbor classifier.
The experiments show that the best distance measure for
this task is the overlapping area combined with the Manhat-
tan distance of the corner points as block distance together
with the minimum weight edge cover matching.
1 Introduction
Most information that is currently available digitally —
especially in libraries — is organized in form of documents,
and those are typically stored in databases. The task of find-
ing relevant information in such databases is a crucial prob-
lem of the information society. Many methods for docu-
ment retrieval exist, but their success depends strongly on
the format in which the documents are stored: the Google
search engine does a very good job in document retrieval
for WWW pages. The Windows operating system con-
tains a search assistant that does a fast full text search in all
MS Word or Excel documents on a PC’s hard disk within
seconds. But so far, no software system can reliably do
content-based search in image or video files, let it be on
the web or locally.
A problem of current querying methods is that they re-
quire a document to be present in text form, and their
method to find similar documents is by comparing the tex-
tual contents.
For documents in image form, as they are produced e.g.
by a scanner, this approach has some drawbacks, since the
document has to be converted to text first by Optical Char-
acter Recognition (OCR) software. This process is com-
putationally expensive, and it can also introduce errors that
may prevent a document from ever being found again. A
more fundamental problem is that the textual contents of the
documents to be searched for can be unknown, e.g. when
searching for all CD-covers on a home PC, or the text infor-
mation is irrelevant or not sufficient to answer a query, e.g.
when a user wants to search for all IEEE-style publications
in an archive.
In this paper, we present a method to query document
image databases by layout, in particular by measuring the
similarity of different layouts in comparison to a reference
or query document. The method works directly on the im-
age data and does not require a costly OCR step. Depending
on the application, it can either be the only search criterion
used or act as an additional search feature for the user.
Distance measures for measuring the similarity of two
layouts can be used in numerous ways. In this paper,
we concentrate on their use for layout-based document re-
trieval. Other possible uses include the benchmarking of
different layout analysis algorithms or tie-breaking in lay-
out analysis system that are based on the combination of
different layout analysis techniques. We also restrict our-
selves to geometric layout information, i.e. how a page is
split into different homogeneous regions like columns and
paragraphs. We consider only Manhattan layouts because
they represent the most general class of layouts found in
practice, and they can be easily represented by a set of rect-
angular blocks. We do not study the logical partitioning of
the page into semantic blocks like title, abstract, and author.
1
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



