ICDAR 2003 robust reading competitions: entries, results, and future directions
- ISSN: 14332833
- DOI: 10.1007/s10032-004-0134-3
Abstract
This paper describes the robust reading competitions for ICDAR 2003. With the rapid growth in research over the last few years on recognizing text in natural scenes, there is an urgent need to establish some common benchmark datasets and gain a clear understanding of the current state of the art. We use the term robust reading to refer to text images that are beyond the capabilities of current commercial OCR packages. We chose to break down the robust reading problem into three subproblems and run competitions for each stage, and also a competition for the best overall system. The subproblems we chose were text locating, character recognition and word recognition. By breaking down the problem in this way, we hoped to gain a better understanding of the state of the art in each of the subproblems. Furthermore, our methodology involved storing detailed results of applying each algorithm to each image in the datasets, allowing researchers to study in depth the strengths and weaknesses of each algorithm. The text-locating contest was the only one to have any entries. We give a brief description of each entry and present the results of this contest, showing cases where the leading entries succeed and fail. We also describe an algorithm for combining the outputs of the individual text locators and show how the combination scheme improves on any of the individual systems.
ICDAR 2003 robust reading competitions: entries, results, and future directions
IJDAR (2005) 7: 105–122
ICDAR 2003 robust reading competitions:
entries, results, and future directions
Simon M. Lucas
1
, Alex Panaretos
1
, Luis Sosa
1
, Anthony Tang
1
, Shirley Wong
1
, Robert Young
1
,
Kazuki Ashida
2
, Hiroki Nagai
2
, Masayuki Okamoto
2
, Hiroaki Yamamoto
2
, Hidetoshi Miyao
2
, JunMin
Zhu
3
, WuWen Ou
3
, Christian Wolf
4
, Jean-Michel Jolion
4
, Leon Todoran
5
, Marcel Worring
5
, Xiaofan
Lin
6
1
Department of Computer Science, University of Essex, Colchester CO4 3SQ, UK
2
Department of Information Engineering, Faculty of Engineering, Shinshu University, 4-17-1 Wakasato Nagano 380-8553,
Japan
3
Institute of Automation, Chinese Academy of Science, PO Box 2738, Beijing 100080, P.R. China
4
Lyon Research Center for Images and Intelligent Information Systems (LIRIS), INSA de Lyon, Bt. J. Verne 20, rue Albert
Einstein, 69621 Villeurbanne cedex, France
5
Informatics Institute, University of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands
6
Hewlett-Packard Laboratories, 1501 Page Mill Road, MS 1203, Palo Alto, CA 94304, USA
Published online: June 21, 2005 – c© Springer-Verlag 2005
Abstract. This paper describes the robust reading
competitions for ICDAR 2003. With the rapid growth
in research over the last few years on recognizing text
in natural scenes, there is an urgent need to establish
some common benchmark datasets and gain a clear
understanding of the current state of the art. We use
the term ‘robust reading’ to refer to text images that
are beyond the capabilities of current commercial OCR
packages. We chose to break down the robust reading
problem into three subproblems and run competitions
for each stage, and also a competition for the best overall
system. The subproblems we chose were text locating,
character recognition and word recognition. By breaking
down the problem in this way, we hoped to gain a better
understanding of the state of the art in each of the
subproblems. Furthermore, our methodology involved
storing detailed results of applying each algorithm to
each image in the datasets, allowing researchers to
study in depth the strengths and weaknesses of each
algorithm. The text-locating contest was the only one
to have any entries. We give a brief description of each
entry and present the results of this contest, showing
cases where the leading entries succeed and fail. We also
describe an algorithm for combining the outputs of the
individual text locators and show how the combination
scheme improves on any of the individual systems.
Keywords: Reading competition – Text locating –
Camera captured
1 Introduction
Fifty years of research in machine reading systems has
seen great progress, and commercial OCR packages now
operate with high speed and accuracy on good-quality
documents. These systems are not robust, however, and
do not work well on poor-quality documents or on
camera-captured text in everyday scenes. The goal of
general-purpose reading systems with human-like speed
and accuracy remains elusive. Applications include data
archive conversion of noisy documents, textual search of
image and video databases, aids for the visually impaired
and reading systems for mobile robots.
Recent years have seen significant research into gen-
eral reading systems that are able to locate and/or read
text in video or natural scene images [7, 8, 11,13,32,33].
So far, however, there have not been any standard pub-
licly available ground-truthed datasets, which severely
limits the conclusions which may be drawn regarding
the relative merits of each approach.
Hence, the aims of these competitions were as follows:
– To capture and ground-truth a significant size text-
in-scene dataset. This should have a shelf-life well
beyond that of the competitions.
– To design or adopt standard formats for these
datasets, and also for the results produced by the
recognizers.
– To design or adopt standard evaluation procedures
according to current best practices.
– To run the competitions in order to get a snapshot
of the current state of the art in this area.
Trained
Recognizer
Raw.xmlTestData.xml
Evaluator
Summary.xmlDetails.xml
True.xml
Results DB
Fig. 1. The multistage evaluation process
We aimed to broadly follow the principles and proce-
dures used to run the Fingerprint Verification 2000 (and
2002) competitions [16]. Well in advance of the deadline
we published sample datasets for each problem, the eval-
uation software to be used, and the criteria for deciding
the winner of each contest. To enter the contests, re-
searchers had to submit their software to us in the form
of a ready-to-run command-line executable. This takes a
test-data input file and produces a raw results file. The
raw results are then compared to the ground truth for
that dataset by an evaluation algorithm, which produces
a set of detailed results and also a summary. The detailed
results report how well the algorithm worked on each im-
age, while the summary results report the aggregate over
all the images in the dataset. All these files are based on
simple XML formats to allow maximum compatibility
between different versions of evaluation systems, recog-
nizers and file formats. In particular, new attributes and
elements can be added to the markup while retaining
backward compatibility with older recognition systems.
The generic process is depicted in Fig. 1.
2 Data capture
Images were captured with a variety of digital cameras
by each of the Essex authors. Cameras were used with a
range of resolution and other settings, with the particu-
lar settings chosen at the discretion of the photographer.
To allow management of the ground-truthing or tag-
ging of the images, and with a view to possible future
tagging jobs, we implemented a Web-based tagging sys-
tem. This operates along similar lines to the OpenMind
concept.
1
People working as taggers can log in to the sys-
1
D. Stork, The Open Mind Initiative,
http://www.openmind.org
tem from anywhere on the Internet using a Java (1.4)-
enabled Web browser. On logging in, a Java applet win-
dow appears and presents a series of images. The tagger
tags each image by dragging rectangles over words and
then typing in the associated text. The applet then sug-
gests a possible segmentation of the word into its indi-
vidual characters, which the tagger can then adjust on a
character-by-character basis. The tagger can also adjust
the slant and rotation of the region. When the tagger
has finished an image, he clicks ‘Submit’, at which point
all the tagged rectangles are sent back to a server, where
they are stored in a database. One of the parameters of
the system is how many taggers should tag each image.
If we had a plentiful supply of tagging effort, then we
could send each image to several taggers and simply ac-
cept all the images where the tags from different taggers
were in broad agreement. This is somewhat wasteful of
tagging effort, however, since it is much quicker to check
an image than it is to tag it. We therefore adopted a
two-tier tagging system of taggers and checkers, where
the job of a checker was to approve a set of tags.
There are several ways of communicating between
the applet and the server. We chose to use Simple Ob-
ject Access Protocol (SOAP) – partly to gain experience
of SOAP on a real project, and partly to allow good in-
teroperability with other systems. Potentially, someone
could now write a tagging application in some other lan-
guage, and still request images to tag, and upload tagged
images to our server.
Figure 2 shows a fragment of XML used to markup
the data we captured. This sample corresponds to the
word Department in Fig. 3. The root element is tagset
and consists of a sequence of image elements – one for
each image in the dataset. The imageName element gives
the relative path to the image file, and the resolution
element gives the width and height of the image. The
taggedRectangles element contains a taggedRectangle el-
<tagset>
<image>
<imageName>scene/ComputerScienceSmall.jpg</imageName>
<resolution x="338" y="255" />
<taggedRectangles>
<taggedRectangle x="99" y="94" width="128" height="20"
offset="0" rotation="0">
<tag>Department</tag>
<segmentation>
<xOff>16</xOff>
<xOff>29</xOff>
<xOff>43</xOff>
<xOff>54</xOff>
<xOff>64</xOff>
<xOff>74</xOff>
<xOff>93</xOff>
<xOff>106</xOff>
<xOff>117</xOff>
</segmentation>
</taggedRectangle>
...
</image>
...
</tagset>
Fig. 2. A sample of our XML format for marking up the
words in images
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


