Towards more effective distance functions for word image matching
Proceedings of the 8th IAPR International Workshop on Document Analysis Systems DAS 10 (2010)
- ISBN: 9781605587738
- DOI: 10.1145/1815330.1815377
Available from portal.acm.org
or
Available from portal.acm.org
Page 1
Towards more effective distance functions for word image matching
Towards More Effective Distance Functions for Word
Image Matching
Raman Jain
Centre for Visual Information Technology
IIIT-Hyderabad, India
ramanjain@students.iiit.ac.in
C. V. Jawahar
Centre for Visual Information Technology
IIIT-Hyderabad, India
jawahar@iiit.ac.in
ABSTRACT
Matching word images has many applications in document
recognition and retrieval systems. Dynamic Time Warping
(DTW) is popularly used to estimate the similarity between
word images. Word images are represented as sequences of
feature vectors, and the cost associated with dynamic pro-
gramming based alignment is considered as the dissimilarity
between them. However, such approaches are computation-
ally costly when compared to fixed length matching schemes.
In this paper, we explore systematic methods for identifying
appropriate distance metrics for a given database or lan-
guage. This is achieved by learning query specific distance
functions which can be computed online efficiently. We show
that a weighted Euclidean distance can outperform DTW for
matching word images. This class of distance functions are
also ideal for scalability and large scale matching. Our re-
sults are validated with mean Average Precision (mAP) on
a fully annotated data set of 160K word images. We then
show that the learnt distance functions can even be extended
to a new database to obtain accurate retrieval.
1. INTRODUCTION
Matching two word images by computing an appropriate
similarity measure, has many applications in document anal-
ysis systems [3, 18, 22]. This includes applications in ac-
cessing historic handwritten manuscripts [16, 21], searching
for relevant documents in a digital library of printed docu-
ments [3], holistic recognition [14] and enhancing OCR accu-
racies by post processing the classification results [11,19]. In
this paper we aim at learning effective similarity measures,
which are specific to word images. We limit our scope to
matching printed word images. Though our approaches are
demonstrated on English, our methods are language inde-
pendent.
Though words can be matched by comparing holistic fea-
tures [15], the popular approach for matching has been align-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
DAS ’10, June 9-11, 2010, Boston, MA, USA
Copyright 2010 ACM 978-1-60558-773-8/10/06 ...$10.00
ing sequences of feature vectors using Dynamic Time Warp-
ing (DTW) [17,20]. A sliding window (or a vertical strip) is
moved from left to right and features are computed for each
window. This results in a sequence of feature vectors. For
computing the distance between two such sequences, they
are first aligned using dynamic programming. Cost of align-
ment is treated as the distance between the two sequences.
To match sequences of length M and N , one needs to do
O(MN) operations. For retrieving from a large data set,
query has to be matched with every word in the database.
For large scale matching and retrieval, thus, this becomes
the bottleneck. Replacing DTW by Euclidean distance (pos-
sibly at the cost of accuracy) has been a step towards scal-
ability [13]. This can result in constant time (i.e., O(d))
retrieval.
Computing similarity based on a simple Euclidean distance
does not use the knowledge that the comparisons are made
between two word images. While computing the distance
between two feature vectors (whether they are made out of
sequences or holistic features), we need to note that the indi-
vidual features need not be uncorrelated. Most of the com-
parison techniques also do not use the fact that the words are
generated out of a language model. This could further im-
pose constraints on the possible feature vectors which could
be generated from words in a given language. In this paper,
we explore distance functions which can be learnt from ex-
amples. Our objective is to capitalize on the fact that not all
possible combination of characters are valid in a language,
and a distance function learnt from training examples can
actually benefit matching and retrieval on a test (unseen or
unannotated) data set. We validate our claims on annotated
and unannotated data sets presented in Section 2. Since we
have a reasonably large annotated corpus for comparison, we
use mean average precision (area under the precision-recall
curve) or mean precision for statistically validating the ap-
proach.
We start our experiments by comparing DTW and Euclidean
distance in Section 3. We observe that it is not reasonable
to conclude DTW is always superior to a fixed length rep-
resentation. This gives us hope for building efficient, at the
same time effective similarity measures. We then show how
a specific weighted Euclidean distance can perform superior
in a given setting (say when the set of possible queries are
known apriori). We design a query-specific classifier (QSC),
which is obtained by learning the weights (parameter asso-
ciated with the distance function) on a “training” data set
363
Image Matching
Raman Jain
Centre for Visual Information Technology
IIIT-Hyderabad, India
ramanjain@students.iiit.ac.in
C. V. Jawahar
Centre for Visual Information Technology
IIIT-Hyderabad, India
jawahar@iiit.ac.in
ABSTRACT
Matching word images has many applications in document
recognition and retrieval systems. Dynamic Time Warping
(DTW) is popularly used to estimate the similarity between
word images. Word images are represented as sequences of
feature vectors, and the cost associated with dynamic pro-
gramming based alignment is considered as the dissimilarity
between them. However, such approaches are computation-
ally costly when compared to fixed length matching schemes.
In this paper, we explore systematic methods for identifying
appropriate distance metrics for a given database or lan-
guage. This is achieved by learning query specific distance
functions which can be computed online efficiently. We show
that a weighted Euclidean distance can outperform DTW for
matching word images. This class of distance functions are
also ideal for scalability and large scale matching. Our re-
sults are validated with mean Average Precision (mAP) on
a fully annotated data set of 160K word images. We then
show that the learnt distance functions can even be extended
to a new database to obtain accurate retrieval.
1. INTRODUCTION
Matching two word images by computing an appropriate
similarity measure, has many applications in document anal-
ysis systems [3, 18, 22]. This includes applications in ac-
cessing historic handwritten manuscripts [16, 21], searching
for relevant documents in a digital library of printed docu-
ments [3], holistic recognition [14] and enhancing OCR accu-
racies by post processing the classification results [11,19]. In
this paper we aim at learning effective similarity measures,
which are specific to word images. We limit our scope to
matching printed word images. Though our approaches are
demonstrated on English, our methods are language inde-
pendent.
Though words can be matched by comparing holistic fea-
tures [15], the popular approach for matching has been align-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
DAS ’10, June 9-11, 2010, Boston, MA, USA
Copyright 2010 ACM 978-1-60558-773-8/10/06 ...$10.00
ing sequences of feature vectors using Dynamic Time Warp-
ing (DTW) [17,20]. A sliding window (or a vertical strip) is
moved from left to right and features are computed for each
window. This results in a sequence of feature vectors. For
computing the distance between two such sequences, they
are first aligned using dynamic programming. Cost of align-
ment is treated as the distance between the two sequences.
To match sequences of length M and N , one needs to do
O(MN) operations. For retrieving from a large data set,
query has to be matched with every word in the database.
For large scale matching and retrieval, thus, this becomes
the bottleneck. Replacing DTW by Euclidean distance (pos-
sibly at the cost of accuracy) has been a step towards scal-
ability [13]. This can result in constant time (i.e., O(d))
retrieval.
Computing similarity based on a simple Euclidean distance
does not use the knowledge that the comparisons are made
between two word images. While computing the distance
between two feature vectors (whether they are made out of
sequences or holistic features), we need to note that the indi-
vidual features need not be uncorrelated. Most of the com-
parison techniques also do not use the fact that the words are
generated out of a language model. This could further im-
pose constraints on the possible feature vectors which could
be generated from words in a given language. In this paper,
we explore distance functions which can be learnt from ex-
amples. Our objective is to capitalize on the fact that not all
possible combination of characters are valid in a language,
and a distance function learnt from training examples can
actually benefit matching and retrieval on a test (unseen or
unannotated) data set. We validate our claims on annotated
and unannotated data sets presented in Section 2. Since we
have a reasonably large annotated corpus for comparison, we
use mean average precision (area under the precision-recall
curve) or mean precision for statistically validating the ap-
proach.
We start our experiments by comparing DTW and Euclidean
distance in Section 3. We observe that it is not reasonable
to conclude DTW is always superior to a fixed length rep-
resentation. This gives us hope for building efficient, at the
same time effective similarity measures. We then show how
a specific weighted Euclidean distance can perform superior
in a given setting (say when the set of possible queries are
known apriori). We design a query-specific classifier (QSC),
which is obtained by learning the weights (parameter asso-
ciated with the distance function) on a “training” data set
363
Page 2
(Section 4). However, this method is restrictive. We then
extend QSC by systematically extrapolating the weights to
get new (or unseen) query’s weight. We demonstrate the
performance of our method on a large corpus of more than
five million words.
2. DATASETS AND EXPERIMENTAL SET-
TING
We first summarize the experimental framework we use through-
out the paper. We consider three different types of data sets
in English. They are aimed at quantitatively evaluating the
performance of the matching methods, as well as demon-
strating the generalization capabilities of learning schemes.
Data sets can be summarized as:
• Calibrated Data (CD): To study the effect of font
and size variations, we consider a calibrated data set of
word images. They are generated by rendering the text
and passing through a document degradation model [27].
Intensity of degradation is characterized using a scalar,
and is used for computing the probability of degrada-
tion for the boundary pixels. We consider two subsets
of this data set, CD1 and CD2. The set CD1 consists
of 1000 words in multiple fonts and sizes. All images
are equivalent to words typeset in 8pt to 15pt, and
scanned at 300 dpi. CD2 is similar to CD1 but has
higher amount of degradation.
• Real Annotated Data (RD): This data set consists
of a set of words with their ground truth (text) asso-
ciated to it. This data set is built out of 765 pages
from scanned books, publicly available on the web [1].
All the words in these pages are manually annotated
for conducting experiments and evaluating the perfor-
mance. There are a total of 162,188 word images in
this collection from four books which vary in fonts.
• Unannotated Data (UD): In addition to the com-
pletely annotated data sets mentioned above, we also
consider a data set of 5,870,486 words which come out
of scanned books. Since the data set is not ground
truth-ed, it can be used primarily to evaluate the pre-
cision for selected queries and not the recall.
Examples from these data sets are shown in Figure 2. The
examples demonstrate how the same word appears in all the
three different data sets. Note that the words in all the three
data sets are not the same. It depends on the content of the
books. However, they overlap. We use these datasets for
designing more effective similarity scores for the comparison
of word images. We use these similarity score for retrieving
similar word images. We evaluate the retrieval performance
with measures such as (i) Precision, (ii) Recall, (iii) F-Score,
(iv) AP, (v) mean of AP. Most of our results are presented
using mean of AP.
Precision is the ratio of number of relevant images to the
total number of images retrieved, for a particular query. It
measures how well a system discards irrelevant results while
retrieving. Recall is the ratio of number of relevant images
retrieved to the total number of relevant images present in
Figure 1: PR-curve on two different queries from
the RD dataset.
Figure 2: A comparison of words from three
databases. Words in first column are from CD1,Words in second column are from RD and Words in
third column are from UD.
the database. It basically measures how well a system finds
what the user wants. By changing a matching threshold,
one can typically increase recall at the cost of precision. A
precision-recall(PR) curve plots how the variation in one af-
fects the other. FScore is the weighted harmonic mean of
precision and recall and measured in isolation. It combines
the precision and recall into one score. Average precision
(AP) measures the area under the precision-recall curve.
Average precision makes use of both recall and precision and
encourages the relevant results to appear at higher ranked
positions. Mean of the APs computed for multiple queries
gives us mean average precision (mAP) [25]. We plot the
precision recall graphs of two words (i.e., “even”and“think”)
in Figure 1. (These are computed in RD as discussed later).
It may be seen that the AP can be significantly different
for different words. The area under the curve for “even” is
0.502 and that for “think” is 0.928. Also note that, with a
given feature set, not all words are equally easy (or difficult)
to retrieve. A probable conclusion based on this is that the
matching scheme (or distance measure) we use need not be
the same for all words.
364
extend QSC by systematically extrapolating the weights to
get new (or unseen) query’s weight. We demonstrate the
performance of our method on a large corpus of more than
five million words.
2. DATASETS AND EXPERIMENTAL SET-
TING
We first summarize the experimental framework we use through-
out the paper. We consider three different types of data sets
in English. They are aimed at quantitatively evaluating the
performance of the matching methods, as well as demon-
strating the generalization capabilities of learning schemes.
Data sets can be summarized as:
• Calibrated Data (CD): To study the effect of font
and size variations, we consider a calibrated data set of
word images. They are generated by rendering the text
and passing through a document degradation model [27].
Intensity of degradation is characterized using a scalar,
and is used for computing the probability of degrada-
tion for the boundary pixels. We consider two subsets
of this data set, CD1 and CD2. The set CD1 consists
of 1000 words in multiple fonts and sizes. All images
are equivalent to words typeset in 8pt to 15pt, and
scanned at 300 dpi. CD2 is similar to CD1 but has
higher amount of degradation.
• Real Annotated Data (RD): This data set consists
of a set of words with their ground truth (text) asso-
ciated to it. This data set is built out of 765 pages
from scanned books, publicly available on the web [1].
All the words in these pages are manually annotated
for conducting experiments and evaluating the perfor-
mance. There are a total of 162,188 word images in
this collection from four books which vary in fonts.
• Unannotated Data (UD): In addition to the com-
pletely annotated data sets mentioned above, we also
consider a data set of 5,870,486 words which come out
of scanned books. Since the data set is not ground
truth-ed, it can be used primarily to evaluate the pre-
cision for selected queries and not the recall.
Examples from these data sets are shown in Figure 2. The
examples demonstrate how the same word appears in all the
three different data sets. Note that the words in all the three
data sets are not the same. It depends on the content of the
books. However, they overlap. We use these datasets for
designing more effective similarity scores for the comparison
of word images. We use these similarity score for retrieving
similar word images. We evaluate the retrieval performance
with measures such as (i) Precision, (ii) Recall, (iii) F-Score,
(iv) AP, (v) mean of AP. Most of our results are presented
using mean of AP.
Precision is the ratio of number of relevant images to the
total number of images retrieved, for a particular query. It
measures how well a system discards irrelevant results while
retrieving. Recall is the ratio of number of relevant images
retrieved to the total number of relevant images present in
Figure 1: PR-curve on two different queries from
the RD dataset.
Figure 2: A comparison of words from three
databases. Words in first column are from CD1,Words in second column are from RD and Words in
third column are from UD.
the database. It basically measures how well a system finds
what the user wants. By changing a matching threshold,
one can typically increase recall at the cost of precision. A
precision-recall(PR) curve plots how the variation in one af-
fects the other. FScore is the weighted harmonic mean of
precision and recall and measured in isolation. It combines
the precision and recall into one score. Average precision
(AP) measures the area under the precision-recall curve.
Average precision makes use of both recall and precision and
encourages the relevant results to appear at higher ranked
positions. Mean of the APs computed for multiple queries
gives us mean average precision (mAP) [25]. We plot the
precision recall graphs of two words (i.e., “even”and“think”)
in Figure 1. (These are computed in RD as discussed later).
It may be seen that the AP can be significantly different
for different words. The area under the curve for “even” is
0.502 and that for “think” is 0.928. Also note that, with a
given feature set, not all words are equally easy (or difficult)
to retrieve. A probable conclusion based on this is that the
matching scheme (or distance measure) we use need not be
the same for all words.
364
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
2 Readers on Mendeley
by Discipline
by Academic Status
50% Ph.D. Student
50% Associate Professor
by Country
50% Colombia
50% Spain


