Extending Page Segmentation Algorithms for Mixed-Layout Document Processing
- ISBN: 9781457713507
- DOI: 10.1109/ICDAR.2011.251
Abstract
With the advent of more powerful personal computers, inexpensive memory and digital cameras, curators around the world are working towards preserving historical documents on computers. Since many of the organizations for which they work have limited funds, there is world-wide interest in a low cost solution to obtaining these digital records in computer readable form. An open source layout analysis system called OCRopus is being developed for such a purpose. In its original state, though, it could not process documents that contained information other than text. Segmenting the page into regions of text and non-text areas is the first step of analyzing a mixed content document, but it did not exist in OCRopus. Therefore, the goal of this thesis was to add this capability so that OCRopus could process a full spectrum of documents. By default, the RAST page segmentation algorithm processed text-only documents at a target resolution of 300 DPI. In a separate module, the Voronoi algorithm divided the page into regions, but did not classify them. Additionally, it tended to oversegment non-text regions and was tuned to a resolution of 300 DPI. Therefore, the RAST algorithm was improved to recognize non-text regions and the Voronoi algorithm was extended to classify text and non-text regions and merge non-text regions appropriately. Finally, both algorithms were modified to perform at a range of resolutions. Testing on a set of documents consisting of different types showed an improvement of 15-40% for the RAST algorithm, giving it at an average segmentation accuracy of about 80%. Partially due to the representation of the ground truth, the Voronoi algorithm did not perform as well as the improved RAST algorithm, averaging around 70% overall. Depending on the layout of the historical documents to be digitized, though, either algorithm could be sufficiently accurate to be utilized.
Author-supplied keywords
Extending Page Segmentation Algorithms for Mixed-Layout Document Processing
Amy Winder
Computer Science Department
Boise State University
Boise, ID 83725, USA.
Tim Andersen
Computer Science Department
Boise State University
Boise, ID 83725, USA.
Elisa H. Barney Smith
Electrical & Computer Engineering Department
Boise State University
Boise, ID 83725, USA.
Abstract—The goal of this work is to add the capability
to segment documents containing text, graphics, and pictures
in the open source OCR engine OCRopus. To achieve this
goal, OCRopus’ RAST algorithm was improved to recognize
non-text regions so that mixed content documents could be
analyzed in addition to text-only documents. Also, a method
for classifying text and non-text regions was developed and
implemented for the Voronoi algorithm enabling users to
perform OCR on documents processed by this method. Finally,
both algorithms were modified to perform at a range of
resolutions.
Our testing showed an improvement of 15-40% for the
RAST algorithm, giving it an average segmentation accuracy
of about 80%. The Voronoi algorithm averaged around 70%
accuracy on our test data. Depending on the particular layout
and idiosyncracies of the documents to be digitized, however,
either algorithm could be sufficiently accurate to be utilized.
Keywords- page segmentation, RAST, Voronoi, open source
OCR
I. INTRODUCTION AND BACKGROUND
Numerous historical documents in book and other forms
have yet to be digitized. Historical books can be too fragile
to be scanned, but today’s inexpensive digital cameras can
produce images comparable in quality and resolution to
those generated by a flatbed scanner. Thus, it is now feasible
to safely and cheaply digitize historical documents.
Once digitized it is desirable to convert document images
into text documents for readability and searchability. A first
step is to analyze the image to determine which areas are
text and which are not, so that only text regions are sent to
the OCR engine. This process is called page segmentation.
Page segmentation algorithms can be categorized as top-
down, bottom-up or hybrid methods [1]. Top-down methods
involve operating on the document as a whole and subdivid-
ing it, whereas bottom-up methods start at the pixel-level and
recursively merge constructs into segmented regions. Hybrid
methods may include a little of both.
The Recursive X-Y Cut (RXYC) and Run-Length Smear-
ing Algorithms (RLSA) are top-down methods. RXYC[2]
uses vertical and horizontal projections of the binarized
—————
Amy Winder is now with Hewlett Packard, Boise.
Contact: EBarneySmith@BoiseState.edu
image where the white areas correspond to low eleva-
tions and the black areas to high elevations. Valleys in
the projections then delineate candidate segmentations. The
algorithm recursively subdivides the document around the
largest valley(s), maintaining the data in a structure called
an X-Y tree. RLSA[3] operates like RXYC, but classifies the
regions as well. It examines each of the pixels in a row-by-
row and column-by-column fashion and changes each white
pixel to black if it is surrounded by enough black pixels, after
which the generated row and column bit maps are ANDed
together to form a single bit map. This is then smoothed
horizontally to connect words in text lines. Block features
(such as numbers of black and white pixels, etc.) determine
block classification.
OCRopus’ version of RAST [4], [5] was designed for
text-only documents and consists of three steps: finding
the columns, finding the text-lines, then determining the
reading order. It finds columns using a whitespace rectangle
algorithm [6] similar to RXYC. The largest whitespace
rectangles (covers) delimited by the connected components
of the image are determined and sorted by how many
connected components touch each major side. Covers are
then merged iteratively as long as the combined cover obeys
a given rule on how many components are incident upon it.
Reading order is determined by considering pairs of lines
such that either the line below or the line to the right at the
top of the page (e.g. in the next column) goes next, followed
by sorting these pairs to give the final reading order.
The Voronoi method [7] is a bottom-up approach that
extracts sample points along the boundaries of the connected
components to construct a Voronoi point diagram, which ini-
tially creates a large number of superfluous edges. Edges are
deleted based on shortness and whether they are connected
to other lines, converting the diagram to an area Voronoi
diagram representing regions.
This paper presents improvements to the RAST and
Voronoi segmentation algorithms found in OCRopus. Before
improvement, RAST was not able to accurately segment
and determine reading order of mixed content documents.
Voronoi tended to oversegment documents that contained
images and did not classify regions as text or non-text so
could not be used for OCR. While a number of methods
have been proposed for region classification [8], [9], [10],
2011 International Conference on Document Analysis and Recognition
1520-5363/11 $26.00 © 2011 IEEE
DOI 10.1109/ICDAR.2011.251
1245
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



