Document ink bleed-through removal with two hidden Markov random fields and a single observation field.
- PubMed: 20075470
Abstract
We present a new method for blind document bleed-through removal based on separate Markov Random Field (MRF) regularization for the recto and for the verso side, where separate priors are derived from the full graph. The segmentation algorithm is based on Bayesian Maximum a Posteriori (MAP) estimation. The advantages of this separate approach are the adaptation of the prior to the contents creation process (e.g., superimposing two handwritten pages), and the improvement of the estimation of the recto pixels through an estimation of the verso pixels covered by recto pixels; moreover, the formulation as a binary labeling problem with two hidden labels per pixels naturally leads to an efficient optimization method based on the minimum cut/maximum flow in a graph. The proposed method is evaluated on scanned document images from the 18th century, showing an improvement of character recognition results compared to other restoration methods.
Document ink bleed-through removal with two hidden Markov random fields and a single observation field.
with Two Hidden Markov Random Fields
and a Single Observation Field
Christian Wolf
Abstract—We present a new method for blind document bleed-through removal based on separate Markov Random Field (MRF)
regularization for the recto and for the verso side, where separate priors are derived from the full graph. The segmentation algorithm is
based on Bayesian Maximum a Posteriori (MAP) estimation. The advantages of this separate approach are the adaptation of the prior
to the contents creation process (e.g., superimposing two handwritten pages), and the improvement of the estimation of the recto
pixels through an estimation of the verso pixels covered by recto pixels; moreover, the formulation as a binary labeling problem with
two hidden labels per pixels naturally leads to an efficient optimization method based on the minimum cut/maximum flow in a graph.
The proposed method is evaluated on scanned document images from the 18th century, showing an improvement of character
recognition results compared to other restoration methods.
Index Terms—Markov random fields, Bayesian estimation, graph cuts, document image restoration.
Ç
1 INTRODUCTION
GENERAL image restoration methods which do not dealwith document image analysis have mostly been
designed to cope with sensor noise, quantization noise,
and optical degradations as blur, defocusing, etc. (see [31]
for a survey). Document images, however, are often
additionally subject to further and stronger degradations:
1. nonstationary noise due to illumination changes;
2. curvature of the document;
3. ink and coffee stains and holes in the document;
4. ink bleed-through: the appearance of the verso-side
text or graphics on the scanned image of the recto
side. This is an important problem when very old
historical documents are processed;
5. low print contrast;
6. errors in the alignment of multiple printing or
imaging stages.
In this paper, we concentrate on ink bleed-through
removal, i.e., the separation of a single scanned document
image into a recto side and a verso side. We assume that a
scan of the verso side is not available (blind separation). In
this case, the task is equivalent to a segmentation problem:
Classify each pixel as either recto, verso, background, or
eventually recto-and-verso (simultaneously), making imme-
diately available the vast collection of widely known
segmentation techniques. However, document images are
a specific type of images with their own properties and
their own specific problems.
At first thought, itmight be agood idea to interpret the task
as a blind source separation problem similar to the “cocktail
party” problems successfully dealt with by the (audio) signal
processing community. The widely used technique indepen-
dent components analysis (ICA) has been applied to docu-
ment bleed-through removal, mainly by Tonazzini and
Bedini [34]. However, ICA assumes a linear model which
makes this formulation questionable: ds ¼ Afs, where ds is
the observation vector, fs is the source vector, and A is the
mixing matrix. The source vectors, corresponding to the
pixels at sites s, are mostly chosen to be three-dimensional:
the recto signal, the verso signal, and an additional signal
adding the background color [34]. In this case, the column
vectors of the mixing matrix become the color vectors for,
respectively, recto pixels, verso pixels, and background
pixels, as can be seen by setting fs ¼ ½1 0 0T , ½0 1 0T , and
½0 0 1T and ds to the respective color vector and solving forA.
We can easily verify that the linear hypothesis cannot be
justified for ink bleed-through by calculating the color of an
observed pixel created by a source pixel which contains
overlapping recto and verso pixels (fs ¼ ½1 1 0T ), thus the
sum of the color vectors for the recto and the verso pixel,
which is unlikely.
The same authors present a nonblind technique also
applicable to grayscale images [37], the different compo-
nents corresponding to the recto scan and the verso scan.
The inverse of the mixing matrix is calculated using
orthogonalization justified by several assumptions on the
degradation process. In [35], the same authors introduce a
double MRF model similar to our proposition combined
with a likelihood term consisting of a linear mixing model.
However, whereas our graphical model is directly em-
ployed for classification, the MRF in [35] guides an
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. X, XXXXXXX 2010 1
. The author is with the Universite´ de Lyon, CNRS, and the INSA-Lyon,
LIRIS UMR 5205, Baˆt. Jules Verne 20, Avenue Albert Einstein, F-69621
Villeurbanne cedex, France. E-mail: christian.wolf@liris.cnrs.fr.
Manuscript received 2 July 2007; revised 17 July 2008; accepted 26 Jan. 2009;
published online 29 Jan. 2009.
Recommended for acceptance by D. Lopresti.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2007-07-0399.
Digital Object Identifier no. 10.1109/TPAMI.2009.33.
0162-8828/10/$26.00 2010 IEEE Published by the IEEE Computer Society
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


