Generalizing Inverse Compositional and ESM Image Alignment
International Journal of Computer Vision (2009)
- ISSN: 09205691
- ISBN: 0769525210
- DOI: 10.1007/s11263-009-0263-8
Available from www.springerlink.com
or
Author-supplied keywords
Page 1
Generalizing Inverse Compositional and ESM Image Alignment
Int J Comput Vis (2010) 87: 191–212
DOI 10.1007/s11263-009-0263-8
Generalizing Inverse Compositional and ESM Image Alignment
Rupert Brooks · Tal Arbel
Received: 9 June 2008 / Accepted: 10 June 2009 / Published online: 7 July 2009
© Springer Science+Business Media, LLC 2009
Abstract Inverse compositional (IC) image alignment (Ba-
ker and Matthews in Int. J. Comput. Vis. 56(3):221–255,
2004) uses the symmetry between the roles of the fixed
and moving images for faster processing. However, it re-
quires implementation of compositional optimizer update
steps. The IC approach can be viewed as an efficient way of
computing the similarity measure derivative relative to the
fixed image warp parameters. Since the mapping between
the fixed and moving warp parameters is continuous and dif-
ferentiable, this derivative can be converted into the moving
warp space using the chain rule. This avoids the need for
compositional update steps. Our generalization also allows
the efficient second order method (ESM) (Malis in Proceed-
ings of the 2004 IEEE International Conference on Robot-
ics and Automation (ICRA04), pp. 1843–1848, 2004; Ben-
himane and Malis in IEEE/RSJ International Conference on
Intelligent Robots and Systems, 2004; Malis and Benhimane
in Robot. Auton. Syst. 52(1):39–52, 2005) to be applied to
general parameterizations of the transformation.
Experiments using multiple similarity measures and op-
timizers show that our generalized IC method equals or ex-
ceeds the performance of the original IC approach. The gen-
eralized ESM approach is more reliable than the classic ap-
proach as it increases the capture radius of the optimization.
R. Brooks (
)
Industrial Materials Institute, National Research Council of
Canada, Boucherville, Quebec, Canada
e-mail: rupert.brooks@imi.cnrc-nrc.gc.ca
T. Arbel
McGill Centre for Intelligent Machines, McGill University,
Montreal, Quebec, Canada
e-mail: arbel@cim.mcgill.ca
Keywords Image alignment · Image registration · Inverse
compositional method · Efficient second order method
1 Introduction
Image alignment, or registration, is a computer vision prob-
lem of great practical importance (Brown 1992; Szeliski
2004; Zitova and Flusser 2003). Direct, or pixel based, ap-
proaches work by defining a difference measure between
the images to be registered. One image referred to as the
fixed image is used as a reference, and the other, referred
to as the moving image is warped to match it. Non-linear
optimization methods are used to find the set of warp pa-
rameters that minimize the difference between the fixed and
warped moving image. This can be very time-consuming for
large datasets such as high-resolution images, or 3D images.
Many applications, such as real-time tracking or registration
of patient imagery in image guided surgery require computa-
tionally efficient approaches. Computational efficiency can
be achieved by either speeding up the evaluation of this dif-
ference measure, or by evaluating it fewer times during the
optimization process.
The image registration problem is usually formulated be-
tween a pair of images, although these remarks can be gen-
eralized to registration of a group of images. For the pair-
wise case, one image is held fixed, and the other is warped
by the transformation. However, we can easily imagine in-
terchanging the roles of the images. Instead of warping the
moving image, the reference image could be inverse warped
by an equivalent amount. Another special characteristic of
this problem is that the optimization step can be done ei-
ther additively, by adding an update to the warp parame-
ters, or compositionally, by composing the warp parame-
ters. A compositional step is equivalent to warping the image
DOI 10.1007/s11263-009-0263-8
Generalizing Inverse Compositional and ESM Image Alignment
Rupert Brooks · Tal Arbel
Received: 9 June 2008 / Accepted: 10 June 2009 / Published online: 7 July 2009
© Springer Science+Business Media, LLC 2009
Abstract Inverse compositional (IC) image alignment (Ba-
ker and Matthews in Int. J. Comput. Vis. 56(3):221–255,
2004) uses the symmetry between the roles of the fixed
and moving images for faster processing. However, it re-
quires implementation of compositional optimizer update
steps. The IC approach can be viewed as an efficient way of
computing the similarity measure derivative relative to the
fixed image warp parameters. Since the mapping between
the fixed and moving warp parameters is continuous and dif-
ferentiable, this derivative can be converted into the moving
warp space using the chain rule. This avoids the need for
compositional update steps. Our generalization also allows
the efficient second order method (ESM) (Malis in Proceed-
ings of the 2004 IEEE International Conference on Robot-
ics and Automation (ICRA04), pp. 1843–1848, 2004; Ben-
himane and Malis in IEEE/RSJ International Conference on
Intelligent Robots and Systems, 2004; Malis and Benhimane
in Robot. Auton. Syst. 52(1):39–52, 2005) to be applied to
general parameterizations of the transformation.
Experiments using multiple similarity measures and op-
timizers show that our generalized IC method equals or ex-
ceeds the performance of the original IC approach. The gen-
eralized ESM approach is more reliable than the classic ap-
proach as it increases the capture radius of the optimization.
R. Brooks (
)
Industrial Materials Institute, National Research Council of
Canada, Boucherville, Quebec, Canada
e-mail: rupert.brooks@imi.cnrc-nrc.gc.ca
T. Arbel
McGill Centre for Intelligent Machines, McGill University,
Montreal, Quebec, Canada
e-mail: arbel@cim.mcgill.ca
Keywords Image alignment · Image registration · Inverse
compositional method · Efficient second order method
1 Introduction
Image alignment, or registration, is a computer vision prob-
lem of great practical importance (Brown 1992; Szeliski
2004; Zitova and Flusser 2003). Direct, or pixel based, ap-
proaches work by defining a difference measure between
the images to be registered. One image referred to as the
fixed image is used as a reference, and the other, referred
to as the moving image is warped to match it. Non-linear
optimization methods are used to find the set of warp pa-
rameters that minimize the difference between the fixed and
warped moving image. This can be very time-consuming for
large datasets such as high-resolution images, or 3D images.
Many applications, such as real-time tracking or registration
of patient imagery in image guided surgery require computa-
tionally efficient approaches. Computational efficiency can
be achieved by either speeding up the evaluation of this dif-
ference measure, or by evaluating it fewer times during the
optimization process.
The image registration problem is usually formulated be-
tween a pair of images, although these remarks can be gen-
eralized to registration of a group of images. For the pair-
wise case, one image is held fixed, and the other is warped
by the transformation. However, we can easily imagine in-
terchanging the roles of the images. Instead of warping the
moving image, the reference image could be inverse warped
by an equivalent amount. Another special characteristic of
this problem is that the optimization step can be done ei-
ther additively, by adding an update to the warp parame-
ters, or compositionally, by composing the warp parame-
ters. A compositional step is equivalent to warping the image
Page 2
192 Int J Comput Vis (2010) 87: 191–212
with one set of parameters, and then warping the result with
another set. These possibilities were investigated by Baker
and Matthews (2004) where the efficient inverse composi-
tional (IC) method is proposed.
The IC method achieves its efficiency by precomputing a
number of components of the calculation, thus speeding up
the calculation of the derivatives of the difference measure.
In this terminology, the usual approach of adding an update
step computed from the moving image, is called the for-
wards additive approach. More recently, these symmetries
were exploited in (Malis 2004; Benhimane and Malis 2004;
Malis and Benhimane 2005) to develop the Efficient Sec-
ond order Minimization (ESM) method. In this approach,
forward and inverse calculations of the derivatives are com-
bined. Efficiency is achieved through more rapid optimizer
convergence, thus requiring fewer calculations of the differ-
ence measure in total. A similar method has also been pro-
posed independently in Keller and Averbuch (2004).
The fundamental advantage of these approaches is that
the derivatives of the image difference measure can be com-
puted more efficiently or accurately by exploiting the sym-
metry in the image registration problem. However, all pro-
posed algorithms have imposed special requirements on the
optimization algorithm, and sometimes on the transforma-
tion itself. These requirements make the implementation
more complex. This also makes it more difficult to change
the optimization approach if needed for particular problems
or to adopt new developments from the field of optimiza-
tion. To date, most implementations have relied heavily on
Gauss-Newton optimization and therefore on least squares
type image difference measures, which are not applicable
to multimodal images. The ESM method also requires that
the transformations be parameterized using the exponential
map of the transformation group. We propose a generaliza-
tion that eliminates the need for specialized optimizers and
transformations. In this way, we hope to make the advan-
tages of these methods more widely applicable.
Multimodal images are frequently registered using the
mutual information cost function, which can capture rela-
tionships between image intensities that other measures can-
not (Pluim et al. 2003). Recently Dowson and Bowden have
proposed an inverse compositional mutual information al-
gorithm called MILK (Mutual Information Lucas Kanade)
(Dowson and Bowden 2008). Unfortunately this approach
still requires a specialized optimizer implementation. Chan-
ging the optimization technique would require reformulat-
ing the entire algorithm. In (Brooks and Arbel 2006) a
method was proposed to generalize the IC method by main-
taining a mapping between the additive update space of the
optimizer, and the compositional update space of the regis-
tration problem. The method did not require a compositional
optimizer update step, but still required rather intricate inter-
action with the optimization algorithm.
In this paper, we show that the IC approach to image
alignment can be viewed as a different parameterization of
the moving image transformation. Thus derivatives compu-
ted relative to one parameterization, can be converted to der-
ivatives relative to another parameterization by using the
familiar chain rule. This allows us to generalize these ef-
ficient image registration methods to a much wider range
of cost functions and optimizers. The following section be-
gins by reviewing approaches to image alignment, in par-
ticular the IC and ESM approaches. We then develop our
proposed generalization in detail in Sect. 3. Our implemen-
tation of this proposal is described in Sect. 4. To test the
algorithm, image registration experiments using three dif-
ferent image difference measures (mean squared difference,
normalized correlation, and mutual information), and three
different types of optimization approach (gradient descent,
quasi-Newton and Newton-Raphson) have been performed.
The experiments, which are described in Sect. 5, include
both simulated and real registration problems on both dig-
ital photographs and medical images. The results show that
our generalized IC approach is more efficient than the clas-
sical, forwards additive, approach and gives equivalent or
better accuracy and reliability when compared to the orig-
inal IC approach. Our generalization also applies to ESM,
where comparisons indicate that it has a better capture ra-
dius which leads to greater reliability when the registration
is started far from the correct answer.
2 Background
2.1 Direct Image Alignment
Direct image alignment (Brown 1992; Szeliski 2004; Zi-
tova and Flusser 2003) is most commonly formulated be-
tween a pair of images. One image, referred to as the fixed
image, If (x), is used as a reference while the other, re-
ferred to as the moving image, Im(x), is warped to match
the first. Each image is defined as a function over some
set of coordinates, x ∈ Rd . The warp, W(x,φ), is a map-
ping from Rd → Rd which is parameterized by φ. To solve
the alignment problem, a difference measure between the
two images is defined, D(If , Im(W(x,φ))). The best align-
ment, φopt , is considered to be the one that minimizes the
difference between the first image and the warped second
image. This can be expressed as an unconstrained non-
linear optimization problem (Brown 1992; Szeliski 2004;
Zitova and Flusser 2003):
φopt = argmin
φ
(D(If (x), Im(W(x,φ)))). (1)
Since If is constant, this can be simplified to the problem of
optimizing D(φ). To avoid becoming trapped in local min-
with one set of parameters, and then warping the result with
another set. These possibilities were investigated by Baker
and Matthews (2004) where the efficient inverse composi-
tional (IC) method is proposed.
The IC method achieves its efficiency by precomputing a
number of components of the calculation, thus speeding up
the calculation of the derivatives of the difference measure.
In this terminology, the usual approach of adding an update
step computed from the moving image, is called the for-
wards additive approach. More recently, these symmetries
were exploited in (Malis 2004; Benhimane and Malis 2004;
Malis and Benhimane 2005) to develop the Efficient Sec-
ond order Minimization (ESM) method. In this approach,
forward and inverse calculations of the derivatives are com-
bined. Efficiency is achieved through more rapid optimizer
convergence, thus requiring fewer calculations of the differ-
ence measure in total. A similar method has also been pro-
posed independently in Keller and Averbuch (2004).
The fundamental advantage of these approaches is that
the derivatives of the image difference measure can be com-
puted more efficiently or accurately by exploiting the sym-
metry in the image registration problem. However, all pro-
posed algorithms have imposed special requirements on the
optimization algorithm, and sometimes on the transforma-
tion itself. These requirements make the implementation
more complex. This also makes it more difficult to change
the optimization approach if needed for particular problems
or to adopt new developments from the field of optimiza-
tion. To date, most implementations have relied heavily on
Gauss-Newton optimization and therefore on least squares
type image difference measures, which are not applicable
to multimodal images. The ESM method also requires that
the transformations be parameterized using the exponential
map of the transformation group. We propose a generaliza-
tion that eliminates the need for specialized optimizers and
transformations. In this way, we hope to make the advan-
tages of these methods more widely applicable.
Multimodal images are frequently registered using the
mutual information cost function, which can capture rela-
tionships between image intensities that other measures can-
not (Pluim et al. 2003). Recently Dowson and Bowden have
proposed an inverse compositional mutual information al-
gorithm called MILK (Mutual Information Lucas Kanade)
(Dowson and Bowden 2008). Unfortunately this approach
still requires a specialized optimizer implementation. Chan-
ging the optimization technique would require reformulat-
ing the entire algorithm. In (Brooks and Arbel 2006) a
method was proposed to generalize the IC method by main-
taining a mapping between the additive update space of the
optimizer, and the compositional update space of the regis-
tration problem. The method did not require a compositional
optimizer update step, but still required rather intricate inter-
action with the optimization algorithm.
In this paper, we show that the IC approach to image
alignment can be viewed as a different parameterization of
the moving image transformation. Thus derivatives compu-
ted relative to one parameterization, can be converted to der-
ivatives relative to another parameterization by using the
familiar chain rule. This allows us to generalize these ef-
ficient image registration methods to a much wider range
of cost functions and optimizers. The following section be-
gins by reviewing approaches to image alignment, in par-
ticular the IC and ESM approaches. We then develop our
proposed generalization in detail in Sect. 3. Our implemen-
tation of this proposal is described in Sect. 4. To test the
algorithm, image registration experiments using three dif-
ferent image difference measures (mean squared difference,
normalized correlation, and mutual information), and three
different types of optimization approach (gradient descent,
quasi-Newton and Newton-Raphson) have been performed.
The experiments, which are described in Sect. 5, include
both simulated and real registration problems on both dig-
ital photographs and medical images. The results show that
our generalized IC approach is more efficient than the clas-
sical, forwards additive, approach and gives equivalent or
better accuracy and reliability when compared to the orig-
inal IC approach. Our generalization also applies to ESM,
where comparisons indicate that it has a better capture ra-
dius which leads to greater reliability when the registration
is started far from the correct answer.
2 Background
2.1 Direct Image Alignment
Direct image alignment (Brown 1992; Szeliski 2004; Zi-
tova and Flusser 2003) is most commonly formulated be-
tween a pair of images. One image, referred to as the fixed
image, If (x), is used as a reference while the other, re-
ferred to as the moving image, Im(x), is warped to match
the first. Each image is defined as a function over some
set of coordinates, x ∈ Rd . The warp, W(x,φ), is a map-
ping from Rd → Rd which is parameterized by φ. To solve
the alignment problem, a difference measure between the
two images is defined, D(If , Im(W(x,φ))). The best align-
ment, φopt , is considered to be the one that minimizes the
difference between the first image and the warped second
image. This can be expressed as an unconstrained non-
linear optimization problem (Brown 1992; Szeliski 2004;
Zitova and Flusser 2003):
φopt = argmin
φ
(D(If (x), Im(W(x,φ)))). (1)
Since If is constant, this can be simplified to the problem of
optimizing D(φ). To avoid becoming trapped in local min-
Page 3
Int J Comput Vis (2010) 87: 191–212 193
Algorithm 1 Classical (forwards additive) Image Alignment
Set start position φ(0); iteration counter n = 0
Repeat
Compute D(φ(n)), ∇φD(φ(n)), and/or HφD(φ(n)) as
needed by the specific algorithm
Compute update step φ(n) from the results of the pre-
vious step.
Set φ(n+1) = φ(n) + φ(n) and n = n + 1
Until convergence criteria reached
ima, the problem is usually solved using a multiscale ap-
proach (Szeliski 2004; Zitova and Flusser 2003). A scale-
space pyramid is formed by repeatedly reducing the images
in size. The optimization is first performed at the smallest
level of the pyramid. The result of this optimization is used
as the starting position for an optimization at the next level,
and so on.
There are many different algorithms for unconstrained
nonlinear optimization (Gill et al. 1981; Fletcher 1987), but
most that have been applied to this problem have the fol-
lowing overall structure: Starting from some initial estimate,
φ(0), they compute the value of the function to be optimized,
D(φ), and possibly its gradient, ∇φD(φ) and/or its Hessian,
HφD(φ). These are then used to determine a step in para-
meter space which is added to the current parameters. The
algorithm iterates until some stopping criteria are met. This
is shown in Algorithm 1. In this and the algorithms that fol-
low, we examine the algorithm for one level of a multiscale
pyramid. In the real implementation, the process is repeated
for each level of the pyramid.
In general, the update step is computed based on a model
of the objective function computed from the values seen so
far (illustrated in Fig. 1). Algorithms differ mainly in how
they compute this model. For example, Newton-Raphson
methods use the Hessian of the cost function, HφD(φ) to
compute a quadratic model, which is then minimized using
φ = −[HφD(φ)]−1 · ∇φD(φ). The true Hessian matrix
is often difficult to compute, and numerous approximations
have been developed.
The special case when the objective function is a mean
of squared differences (MSD) arises very often. In this case,
a Gauss-Newton approximation to the Hessian can be used.
This was first applied to image registration by Lucas and
Kanade (1981), and image registration using this approach
is often called the Lucas-Kanade algorithm. In this context,
the sum of squared pixel differences is used; specifically,
DMSD(φ) =
1
N
N
∑
i=1
[
Ifi − Imi (φ)
]2
. (2)
The gradient of DMSD is
∇φDMSD(φ) =
2
N
JT
[
If − Im(φ)
]
Fig. 1 In general, a nonlinear optimization algorithm minimizes an
approximate model of the cost function at each step. Here the level
curves of the true objective function are shown in solid lines, those of
the approximation in dotted lines. The update step (black arrow) will
minimize the approximate model. A new model will then be computed
at the next iteration
Fig. 2 The derivatives of each moving image pixel with respect to the
transformation parameters give rise to a hyperplane in image space.
The Gauss-Newton point is the closest point to the fixed image on that
hyperplane. Note that the hyperplane has the same number of dimen-
sions as the transformation parameters, and the image space in which
it is embedded has the dimensionality of the number of pixels (Simard
et al. 2000)
where J = ∇φIm(φ). The Hessian of this cost function is
HφDMSD(φ) =
2
N
JT J +
2
N
[
If − Im(φm)
]
HφIm(φ). (3)
The Gauss-Newton approximation assumes that the second
term in Eq. 3 is small and ignores it. The update step is then:
φm(n) = −(J
T J )−1J T
[
If − Im(φ(n))
]
. (4)
The Gauss-Newton approach has an elegant visual interpre-
tation (Simard et al. 2000) (Fig. 2). The derivatives of the
Algorithm 1 Classical (forwards additive) Image Alignment
Set start position φ(0); iteration counter n = 0
Repeat
Compute D(φ(n)), ∇φD(φ(n)), and/or HφD(φ(n)) as
needed by the specific algorithm
Compute update step φ(n) from the results of the pre-
vious step.
Set φ(n+1) = φ(n) + φ(n) and n = n + 1
Until convergence criteria reached
ima, the problem is usually solved using a multiscale ap-
proach (Szeliski 2004; Zitova and Flusser 2003). A scale-
space pyramid is formed by repeatedly reducing the images
in size. The optimization is first performed at the smallest
level of the pyramid. The result of this optimization is used
as the starting position for an optimization at the next level,
and so on.
There are many different algorithms for unconstrained
nonlinear optimization (Gill et al. 1981; Fletcher 1987), but
most that have been applied to this problem have the fol-
lowing overall structure: Starting from some initial estimate,
φ(0), they compute the value of the function to be optimized,
D(φ), and possibly its gradient, ∇φD(φ) and/or its Hessian,
HφD(φ). These are then used to determine a step in para-
meter space which is added to the current parameters. The
algorithm iterates until some stopping criteria are met. This
is shown in Algorithm 1. In this and the algorithms that fol-
low, we examine the algorithm for one level of a multiscale
pyramid. In the real implementation, the process is repeated
for each level of the pyramid.
In general, the update step is computed based on a model
of the objective function computed from the values seen so
far (illustrated in Fig. 1). Algorithms differ mainly in how
they compute this model. For example, Newton-Raphson
methods use the Hessian of the cost function, HφD(φ) to
compute a quadratic model, which is then minimized using
φ = −[HφD(φ)]−1 · ∇φD(φ). The true Hessian matrix
is often difficult to compute, and numerous approximations
have been developed.
The special case when the objective function is a mean
of squared differences (MSD) arises very often. In this case,
a Gauss-Newton approximation to the Hessian can be used.
This was first applied to image registration by Lucas and
Kanade (1981), and image registration using this approach
is often called the Lucas-Kanade algorithm. In this context,
the sum of squared pixel differences is used; specifically,
DMSD(φ) =
1
N
N
∑
i=1
[
Ifi − Imi (φ)
]2
. (2)
The gradient of DMSD is
∇φDMSD(φ) =
2
N
JT
[
If − Im(φ)
]
Fig. 1 In general, a nonlinear optimization algorithm minimizes an
approximate model of the cost function at each step. Here the level
curves of the true objective function are shown in solid lines, those of
the approximation in dotted lines. The update step (black arrow) will
minimize the approximate model. A new model will then be computed
at the next iteration
Fig. 2 The derivatives of each moving image pixel with respect to the
transformation parameters give rise to a hyperplane in image space.
The Gauss-Newton point is the closest point to the fixed image on that
hyperplane. Note that the hyperplane has the same number of dimen-
sions as the transformation parameters, and the image space in which
it is embedded has the dimensionality of the number of pixels (Simard
et al. 2000)
where J = ∇φIm(φ). The Hessian of this cost function is
HφDMSD(φ) =
2
N
JT J +
2
N
[
If − Im(φm)
]
HφIm(φ). (3)
The Gauss-Newton approximation assumes that the second
term in Eq. 3 is small and ignores it. The update step is then:
φm(n) = −(J
T J )−1J T
[
If − Im(φ(n))
]
. (4)
The Gauss-Newton approach has an elegant visual interpre-
tation (Simard et al. 2000) (Fig. 2). The derivatives of the
Page 4
194 Int J Comput Vis (2010) 87: 191–212
pixel values with respect to the transformation parameters
give rise to a hyperplane passing through the moving image
in the space of possible images. The Gauss-Newton step is
the point on that hyperplane closest (in Euclidean distance)
to the fixed image (Simard et al. 2000).
While the Gauss-Newton approximation is very widely
used, it is not always appropriate. If it is not practical to
compute or approximate the Hessian, other methods must
be used. These include simple gradient descent, where a
step is taken in the direction of the gradient at each itera-
tion, and Quasi-Newton methods, which gradually build up
an approximation to the Hessian from the derivatives of the
objective function. Different optimization methods are ap-
propriate for different registration problems. As the inverse
compositional and efficient second order methods require
modifications of the optimization algorithm, specific imple-
mentations would have to be developed for each different
case, which is undesirable.
2.2 Special Characteristics of the Image Registration
Problem
The image registration problem has two special characteris-
tics which have allowed the development of efficiency im-
provements. First, a warp of the moving image can be con-
sidered equivalent to an inverse warp of the fixed image.
Therefore computing a warp for the fixed image is just as
valid as warping the moving image. Thus we can generalize
the image registration problem to:
[
φfopt
φmopt
]
= argmin
φf ,φm
(D(If (W(x,φf )), Im(W(x,φm)))),
(5)
where φf is the parameter vector of a warp of the fixed im-
age. This optimization has many solutions, as there are an
infinite number of possible pairs of φf ,φm that are concep-
tually equivalent. (In practice, however, due to sampling and
interpolation effects, actually warping the fixed as well as
the moving image results would give slightly different φm
for each possible φf . In our approach, as in all the inverse
compositional approaches, φf is left at the identity warp.)
Second, most warps that are of interest can be both com-
posed and inverted. This means that we can consider com-
bining the update step with the current warp parameters by
either adding the parameters, or by composing the warps.
Thus, D(φf ,φm) is equivalent to D(φm ◦ [φf ]−1). In a
slight abuse of notation, we will write φ−1 to mean the pa-
rameters of the transformation that is the inverse of W(·,φ)
and φ1 ◦ φ0 to mean the composition of the transformations
W(·,φ1) and W(·,φ0). Specifically
W(W(x,φ),φ−1) = x; ∀x ∈ Rd
and,
W(x,φ1 ◦ φ0) = W(W(x,φ0),φ1); ∀x ∈ Rd .
Clearly, these rely on composition and inversion being valid
operations on warps—that is—that the warps form a group.
However, it has been shown in Matthews and Baker (2004)
that, even with approximate inverse and composition op-
erators, these relationships will work. The different varia-
tions in the direct image alignment solution that this spe-
cial structure gives rise to have been analyzed in detail in
Baker and Matthews (2004), where the inverse composi-
tional method is proposed. In the following section, we de-
scribe this method and the difficulties that arise in applying
it to certain image registration problems.
2.3 Inverse Compositional Method
Baker and Matthews (2004) proposed the inverse compo-
sitional (IC) method as a faster algorithm for image align-
ment. They note that an update step φf for the fixed image
can be computed efficiently when φf is kept at the identity
warp. Because of the equivalence between the transforma-
tion of the fixed and moving images, this update of φf is
equivalent to updating φm using
φm(n+1) = φm(n) ◦
[
φ−1f(n)
]
.
This computes the update to φm by inverting and composing
the update of φf with the current φm (see Algorithm 2).
The IC method is significantly faster than the forward ad-
ditive method because the search is always performed ar-
ound the zero warp of the fixed image, and therefore the
Jacobian of the fixed image with respect to the parameters,
Jf = ∇φf If (φf ), and the Gauss-Newton approximation
to the Hessian, Hφf D(φf ,φm) = J Tf Jf , may be precom-
puted. This idea has been applied in several contexts, includ-
ing active appearance models (Matthews and Baker 2004),
and 3D medical data (Andreopoulos and Tsotsos 2005;
Algorithm 2 Inverse Compositional Image Alignment
Set start position φm0 ; φf = 0; iteration counter n = 0
Compute the derivative of the fixed image with respect to
the warp parameters, J Tf = ∇φf If (φf )
Repeat
Compute D(φf ,φm(n)), ∇φf D(φf ,φm(n)), and/or
Hφf D(φf ,φm(n)) as needed by the specific optimiza-
tion technique.
Compute update step φf from the results of the pre-
vious step.
Set φm(n+1) = φm(n) ◦ φ
−1
f(n)
and n = n + 1
Until convergence criteria reached
pixel values with respect to the transformation parameters
give rise to a hyperplane passing through the moving image
in the space of possible images. The Gauss-Newton step is
the point on that hyperplane closest (in Euclidean distance)
to the fixed image (Simard et al. 2000).
While the Gauss-Newton approximation is very widely
used, it is not always appropriate. If it is not practical to
compute or approximate the Hessian, other methods must
be used. These include simple gradient descent, where a
step is taken in the direction of the gradient at each itera-
tion, and Quasi-Newton methods, which gradually build up
an approximation to the Hessian from the derivatives of the
objective function. Different optimization methods are ap-
propriate for different registration problems. As the inverse
compositional and efficient second order methods require
modifications of the optimization algorithm, specific imple-
mentations would have to be developed for each different
case, which is undesirable.
2.2 Special Characteristics of the Image Registration
Problem
The image registration problem has two special characteris-
tics which have allowed the development of efficiency im-
provements. First, a warp of the moving image can be con-
sidered equivalent to an inverse warp of the fixed image.
Therefore computing a warp for the fixed image is just as
valid as warping the moving image. Thus we can generalize
the image registration problem to:
[
φfopt
φmopt
]
= argmin
φf ,φm
(D(If (W(x,φf )), Im(W(x,φm)))),
(5)
where φf is the parameter vector of a warp of the fixed im-
age. This optimization has many solutions, as there are an
infinite number of possible pairs of φf ,φm that are concep-
tually equivalent. (In practice, however, due to sampling and
interpolation effects, actually warping the fixed as well as
the moving image results would give slightly different φm
for each possible φf . In our approach, as in all the inverse
compositional approaches, φf is left at the identity warp.)
Second, most warps that are of interest can be both com-
posed and inverted. This means that we can consider com-
bining the update step with the current warp parameters by
either adding the parameters, or by composing the warps.
Thus, D(φf ,φm) is equivalent to D(φm ◦ [φf ]−1). In a
slight abuse of notation, we will write φ−1 to mean the pa-
rameters of the transformation that is the inverse of W(·,φ)
and φ1 ◦ φ0 to mean the composition of the transformations
W(·,φ1) and W(·,φ0). Specifically
W(W(x,φ),φ−1) = x; ∀x ∈ Rd
and,
W(x,φ1 ◦ φ0) = W(W(x,φ0),φ1); ∀x ∈ Rd .
Clearly, these rely on composition and inversion being valid
operations on warps—that is—that the warps form a group.
However, it has been shown in Matthews and Baker (2004)
that, even with approximate inverse and composition op-
erators, these relationships will work. The different varia-
tions in the direct image alignment solution that this spe-
cial structure gives rise to have been analyzed in detail in
Baker and Matthews (2004), where the inverse composi-
tional method is proposed. In the following section, we de-
scribe this method and the difficulties that arise in applying
it to certain image registration problems.
2.3 Inverse Compositional Method
Baker and Matthews (2004) proposed the inverse compo-
sitional (IC) method as a faster algorithm for image align-
ment. They note that an update step φf for the fixed image
can be computed efficiently when φf is kept at the identity
warp. Because of the equivalence between the transforma-
tion of the fixed and moving images, this update of φf is
equivalent to updating φm using
φm(n+1) = φm(n) ◦
[
φ−1f(n)
]
.
This computes the update to φm by inverting and composing
the update of φf with the current φm (see Algorithm 2).
The IC method is significantly faster than the forward ad-
ditive method because the search is always performed ar-
ound the zero warp of the fixed image, and therefore the
Jacobian of the fixed image with respect to the parameters,
Jf = ∇φf If (φf ), and the Gauss-Newton approximation
to the Hessian, Hφf D(φf ,φm) = J Tf Jf , may be precom-
puted. This idea has been applied in several contexts, includ-
ing active appearance models (Matthews and Baker 2004),
and 3D medical data (Andreopoulos and Tsotsos 2005;
Algorithm 2 Inverse Compositional Image Alignment
Set start position φm0 ; φf = 0; iteration counter n = 0
Compute the derivative of the fixed image with respect to
the warp parameters, J Tf = ∇φf If (φf )
Repeat
Compute D(φf ,φm(n)), ∇φf D(φf ,φm(n)), and/or
Hφf D(φf ,φm(n)) as needed by the specific optimiza-
tion technique.
Compute update step φf from the results of the pre-
vious step.
Set φm(n+1) = φm(n) ◦ φ
−1
f(n)
and n = n + 1
Until convergence criteria reached
Page 5
Int J Comput Vis (2010) 87: 191–212 195
Baker et al. 2004). Extensions to the method have incor-
porated intensity changes into the cost function (Bartoli
2006) and application of the approach to mutual information
(Dowson and Bowden 2008; Brooks and Arbel 2006). How-
ever, in all cases, the update step must be compositional.
This poses a problem because most widely available opti-
mization methods are additive. Thus using the IC method re-
quires modifications to the optimization algorithm. In previ-
ous work (Brooks and Arbel 2006), we extended the method
to general optimizers and cost functions, but this extension
still required careful interaction with the optimizer update
step. The advantages of the IC method all relate to how effi-
ciently it allows computation of ∇φD, and the requirements
for changes to the optimization algorithm only increase the
difficulty of its implementation, and limit its applicability.
2.4 Efficient Second Order Minimization
Efficient Second order Minimization (ESM) was proposed
by Benhimane and Malis (2004), Malis (2004), Malis and
Benhimane (2005) for tracking and visual servoing. This
method does not speed up the computation of D, In fact, it
may slow it down as more computation is done at each step.
Instead, it achieves its efficiency by reducing the number of
optimization steps required to find the optimum.
The ESM method (Algorithm 3) is based on the observa-
tion that due to the symmetry of the image registration prob-
lem, the Gauss-Newton approximation to the Hessian could
be calculated with respect to either φf or φm. The update
step from each would be slightly different. They show that
the approximate Hessian given by the average of the two ap-
proaches is a superior approximation to the one from either
approach alone. Thus they use an update step of
φm(n) = −(
¯J T ¯J )−1 ¯J T
[
If − Im(φm(n))
] (6)
where ¯J is,
¯J =
Jm − Jf
2
= ∇φmIm(φm) − ∇φf If (φf ), (7)
the average of the Jacobians computed with respect to ei-
ther set of transform parameters. A similar insight has been
Algorithm 3 Efficient Second Order Image Alignment
Set start position φm0 ; φf = 0; iteration counter n = 0
Compute Jf = ∇φf If (φf )
Repeat
Compute D(φf ,φm(n)), Jm = ∇φmIm(φm),
∇φf D(φf ,φm(n)), and ∇φmD(φf ,φm(n)).
Compute ¯J using Eq. 7.
Compute update step φm(n) based on Eq. 6
Set φm(n+1) = φm(n) ◦ φm(n) and n = n + 1
Until convergence criteria reached
proposed in Keller and Averbuch (2004), without addressing
the generality of Eq. 7.
As originally presented, the ESM method could only be
applied where a Gauss-Newton approximation is being used.
However, we consider that the deeper insight of the ESM
method is that the update steps computed using the fixed
image, and the one using the moving image are not identi-
cal, and that the combination of both may prove superior to
either one.
A more serious limitation is that, as discussed in (Malis
2004; Benhimane and Malis 2004; Malis and Benhimane
2005), the ESM method can only be applied in very specific
circumstances. For it to work, combining the two Jacobians
using Eq. 7 must be a meaningful thing to do. This implies
that the warp must have the property that φ−1 = −φ. This
is not true in general. However, it is true in the neighbor-
hood of the identity warp, when the warp forms a Lie group
and is parameterized using its exponential map (Malis and
Benhimane 2005). Thus, they must express the warp using
that parameterization, and compute the J ’s for both the fixed
and moving image around the identity warp. This also re-
quires the use of the forward compositional approach for
Jm. This limits the applicability of this approach. It may be
undesirable to use the exponential map parameterization, as
it requires an evaluation of a matrix exponential, which can
be difficult (Moler and Van Loan 2003). Furthermore, while
moving along the exponential map is a geodesic on the trans-
formation manifold, it generally induces a relatively curved
path in the image space, which can make the optimization
more difficult.
In the following section we show that the symmetry in
the image registration problem means that the moving im-
age warp can be considered as a smooth function of the fixed
image warp. As a result, gradients with respect to one set of
parameters can be converted to gradients with respect to the
other. This allows us to generalize both methods to use ad-
ditive optimizers, and any parameterization of the transfor-
mations that allows composition and inversion.
3 Generalized Approach
The IC and ESM methods for efficient image alignment have
special characteristics which make their widespread adop-
tion difficult. The IC approach maps update steps from the
space of φf to the space of φm using inversion and compo-
sition. This requires customization of any optimization algo-
rithm intended to be used. The ESM approach combines the
forward and inverse Gauss-Newton steps, but to do so relies
on the transformation having the property that φ−1 = −φ.
This is undesirable for similar reasons: such parameteriza-
tions are not be universally available, and may present im-
plementation difficulties.
Baker et al. 2004). Extensions to the method have incor-
porated intensity changes into the cost function (Bartoli
2006) and application of the approach to mutual information
(Dowson and Bowden 2008; Brooks and Arbel 2006). How-
ever, in all cases, the update step must be compositional.
This poses a problem because most widely available opti-
mization methods are additive. Thus using the IC method re-
quires modifications to the optimization algorithm. In previ-
ous work (Brooks and Arbel 2006), we extended the method
to general optimizers and cost functions, but this extension
still required careful interaction with the optimizer update
step. The advantages of the IC method all relate to how effi-
ciently it allows computation of ∇φD, and the requirements
for changes to the optimization algorithm only increase the
difficulty of its implementation, and limit its applicability.
2.4 Efficient Second Order Minimization
Efficient Second order Minimization (ESM) was proposed
by Benhimane and Malis (2004), Malis (2004), Malis and
Benhimane (2005) for tracking and visual servoing. This
method does not speed up the computation of D, In fact, it
may slow it down as more computation is done at each step.
Instead, it achieves its efficiency by reducing the number of
optimization steps required to find the optimum.
The ESM method (Algorithm 3) is based on the observa-
tion that due to the symmetry of the image registration prob-
lem, the Gauss-Newton approximation to the Hessian could
be calculated with respect to either φf or φm. The update
step from each would be slightly different. They show that
the approximate Hessian given by the average of the two ap-
proaches is a superior approximation to the one from either
approach alone. Thus they use an update step of
φm(n) = −(
¯J T ¯J )−1 ¯J T
[
If − Im(φm(n))
] (6)
where ¯J is,
¯J =
Jm − Jf
2
= ∇φmIm(φm) − ∇φf If (φf ), (7)
the average of the Jacobians computed with respect to ei-
ther set of transform parameters. A similar insight has been
Algorithm 3 Efficient Second Order Image Alignment
Set start position φm0 ; φf = 0; iteration counter n = 0
Compute Jf = ∇φf If (φf )
Repeat
Compute D(φf ,φm(n)), Jm = ∇φmIm(φm),
∇φf D(φf ,φm(n)), and ∇φmD(φf ,φm(n)).
Compute ¯J using Eq. 7.
Compute update step φm(n) based on Eq. 6
Set φm(n+1) = φm(n) ◦ φm(n) and n = n + 1
Until convergence criteria reached
proposed in Keller and Averbuch (2004), without addressing
the generality of Eq. 7.
As originally presented, the ESM method could only be
applied where a Gauss-Newton approximation is being used.
However, we consider that the deeper insight of the ESM
method is that the update steps computed using the fixed
image, and the one using the moving image are not identi-
cal, and that the combination of both may prove superior to
either one.
A more serious limitation is that, as discussed in (Malis
2004; Benhimane and Malis 2004; Malis and Benhimane
2005), the ESM method can only be applied in very specific
circumstances. For it to work, combining the two Jacobians
using Eq. 7 must be a meaningful thing to do. This implies
that the warp must have the property that φ−1 = −φ. This
is not true in general. However, it is true in the neighbor-
hood of the identity warp, when the warp forms a Lie group
and is parameterized using its exponential map (Malis and
Benhimane 2005). Thus, they must express the warp using
that parameterization, and compute the J ’s for both the fixed
and moving image around the identity warp. This also re-
quires the use of the forward compositional approach for
Jm. This limits the applicability of this approach. It may be
undesirable to use the exponential map parameterization, as
it requires an evaluation of a matrix exponential, which can
be difficult (Moler and Van Loan 2003). Furthermore, while
moving along the exponential map is a geodesic on the trans-
formation manifold, it generally induces a relatively curved
path in the image space, which can make the optimization
more difficult.
In the following section we show that the symmetry in
the image registration problem means that the moving im-
age warp can be considered as a smooth function of the fixed
image warp. As a result, gradients with respect to one set of
parameters can be converted to gradients with respect to the
other. This allows us to generalize both methods to use ad-
ditive optimizers, and any parameterization of the transfor-
mations that allows composition and inversion.
3 Generalized Approach
The IC and ESM methods for efficient image alignment have
special characteristics which make their widespread adop-
tion difficult. The IC approach maps update steps from the
space of φf to the space of φm using inversion and compo-
sition. This requires customization of any optimization algo-
rithm intended to be used. The ESM approach combines the
forward and inverse Gauss-Newton steps, but to do so relies
on the transformation having the property that φ−1 = −φ.
This is undesirable for similar reasons: such parameteriza-
tions are not be universally available, and may present im-
plementation difficulties.
Page 6
196 Int J Comput Vis (2010) 87: 191–212
Note that both the IC and ESM approaches rely on the
special equivalence that exists in the pairwise image regis-
tration problem. Specifically,
D(If (W(x,φf )), Im(W(x,φm))), (8)
is equivalent to
D(If (x), Im(W(W(x,φ
−1
f ),φm))). (9)
When the composition and inversion operations are
smooth and differentiable—something that is implied al-
ready by the existing requirements to use either of these
methods—then there is a one-to-one differentiable mapping
between warps of the fixed image and warps of the moving
image. At a particular step n, we can express the moving
image warp as a function of the fixed image warp:
φm(n+1) (φf(n) ) = φm(n) ◦ φ
−1
f(n)
,
which reparameterizes the cost function in Eq. 1 in terms of
φf : D(If (x), Im(W(x,φm(n)(φf(n) )))).
In the inverse compositional method we have an efficient
way of computing ∂D∂φf , but standard optimization methods
require the derivative with respect to φm, ∂D∂φm . Expressing
φf as a function of φm, and applying the chain rule, we get
∂D
∂φm
=
∂D
∂φf
·
∂φf
∂φm
. (10)
The Gauss-Newton approximation to the Hessian is the
sum of the outer products of the gradient of each pixel with
respect to the transformation parameters.
Hφf D(φf ) J
T
f Jf =
∑
i
∂Ifi
∂φf
T
·
∂Ifi
∂φf
, (11)
where Ifi is pixel i of If . Each of these partial derivatives
can be converted using the chain rule as above,
∂Imi
∂φm
=
∂Ifi
∂φf
·
∂φf
∂φm
,
giving,
HφmD(φf ,φm) ≈
∑
i
∂φf
∂φm
T ∂Ifi
∂φf
T
·
∂Ifi
∂φf
∂φf
∂φm
,
which simplifies to
HφmD(φf ,φm) ≈
∂φf
∂φm
T
∑
i
[
∂Ifi
∂φf
T
·
∂Ifi
∂φf
]
∂φf
∂φm
. (12)
Thus, it is possible to convert the gradient and Hessian of
the image difference measure with respect to the fixed im-
age warp parameters to be derivatives with respect to the
Algorithm 4 Generalized Inverse Compositional Image
Alignment
Set start position φm(0); φf = 0; iteration counter n = 0
Compute Jf = ∇φf If (φf )
Repeat
Compute D(φf ,φm(n)), ∇φf D(φf ,φm(n)) and
Hφf D(φf ,φm(n)) as required
Convert ∇φf D(φf ,φm(n)) to ∇φmD(φf ,φm(n)) and
Hφf D(φf ,φm) to HφmD(φf ,φm(n)) using Eqs. 10, 12
Compute update step φm using the chosen optimiza-
tion method.
Set φm(n+1) = φm(n) + φm(n) and n = n + 1. Note that
this is an additive step.
Until convergence criteria reached
moving image warp parameters as long as we can compute
the Jacobian matrix, ∂φf∂φm .
3.1 Generalized Inverse Compositional Alignment
Our proposal for generalizing the IC algorithm is to use
these methods to convert the derivative and Gauss-Newton
approximation to the Hessian from being with respect to the
fixed image warp parameters to being with respect to the
moving image warp parameters. These converted derivatives
can be provided to any standard optimization technique,
without requiring any special modifications of the optimiza-
tion algorithm. In particular, the update step is now additive,
which permits using standard additive optimizers. The re-
sulting generalized algorithm is shown in Algorithm 4.
One may ask how this generalization will affect the steps
taken by the optimization process. The steps taken by this
algorithm will differ from the original IC algorithm. The
original algorithm—when not using a single Gauss-New-
ton step—moves along a straight line in the φf space. This
line is generally curved if it is converted into the φm space.
The generalized approach moves along a straight line in φm
space. This line is tangent to the curve produced by the orig-
inal algorithm (see Fig. 3).
3.2 Generalized ESM
The ESM method can be also be generalized using Eqs. 10–
12. Any derivative with respect to φf can be converted to
the space of φm before averaging it with the corresponding
derivative computed relative to φm.1 It is no longer neces-
sary to restrict the parameterization so that Eq. 7 is valid:
1However, the name “Efficient Second Order Method” may become a
misnomer when it is generalized in this way, as it is only a second order
approximation when the Jacobian pseudoinverse is being used.
Note that both the IC and ESM approaches rely on the
special equivalence that exists in the pairwise image regis-
tration problem. Specifically,
D(If (W(x,φf )), Im(W(x,φm))), (8)
is equivalent to
D(If (x), Im(W(W(x,φ
−1
f ),φm))). (9)
When the composition and inversion operations are
smooth and differentiable—something that is implied al-
ready by the existing requirements to use either of these
methods—then there is a one-to-one differentiable mapping
between warps of the fixed image and warps of the moving
image. At a particular step n, we can express the moving
image warp as a function of the fixed image warp:
φm(n+1) (φf(n) ) = φm(n) ◦ φ
−1
f(n)
,
which reparameterizes the cost function in Eq. 1 in terms of
φf : D(If (x), Im(W(x,φm(n)(φf(n) )))).
In the inverse compositional method we have an efficient
way of computing ∂D∂φf , but standard optimization methods
require the derivative with respect to φm, ∂D∂φm . Expressing
φf as a function of φm, and applying the chain rule, we get
∂D
∂φm
=
∂D
∂φf
·
∂φf
∂φm
. (10)
The Gauss-Newton approximation to the Hessian is the
sum of the outer products of the gradient of each pixel with
respect to the transformation parameters.
Hφf D(φf ) J
T
f Jf =
∑
i
∂Ifi
∂φf
T
·
∂Ifi
∂φf
, (11)
where Ifi is pixel i of If . Each of these partial derivatives
can be converted using the chain rule as above,
∂Imi
∂φm
=
∂Ifi
∂φf
·
∂φf
∂φm
,
giving,
HφmD(φf ,φm) ≈
∑
i
∂φf
∂φm
T ∂Ifi
∂φf
T
·
∂Ifi
∂φf
∂φf
∂φm
,
which simplifies to
HφmD(φf ,φm) ≈
∂φf
∂φm
T
∑
i
[
∂Ifi
∂φf
T
·
∂Ifi
∂φf
]
∂φf
∂φm
. (12)
Thus, it is possible to convert the gradient and Hessian of
the image difference measure with respect to the fixed im-
age warp parameters to be derivatives with respect to the
Algorithm 4 Generalized Inverse Compositional Image
Alignment
Set start position φm(0); φf = 0; iteration counter n = 0
Compute Jf = ∇φf If (φf )
Repeat
Compute D(φf ,φm(n)), ∇φf D(φf ,φm(n)) and
Hφf D(φf ,φm(n)) as required
Convert ∇φf D(φf ,φm(n)) to ∇φmD(φf ,φm(n)) and
Hφf D(φf ,φm) to HφmD(φf ,φm(n)) using Eqs. 10, 12
Compute update step φm using the chosen optimiza-
tion method.
Set φm(n+1) = φm(n) + φm(n) and n = n + 1. Note that
this is an additive step.
Until convergence criteria reached
moving image warp parameters as long as we can compute
the Jacobian matrix, ∂φf∂φm .
3.1 Generalized Inverse Compositional Alignment
Our proposal for generalizing the IC algorithm is to use
these methods to convert the derivative and Gauss-Newton
approximation to the Hessian from being with respect to the
fixed image warp parameters to being with respect to the
moving image warp parameters. These converted derivatives
can be provided to any standard optimization technique,
without requiring any special modifications of the optimiza-
tion algorithm. In particular, the update step is now additive,
which permits using standard additive optimizers. The re-
sulting generalized algorithm is shown in Algorithm 4.
One may ask how this generalization will affect the steps
taken by the optimization process. The steps taken by this
algorithm will differ from the original IC algorithm. The
original algorithm—when not using a single Gauss-New-
ton step—moves along a straight line in the φf space. This
line is generally curved if it is converted into the φm space.
The generalized approach moves along a straight line in φm
space. This line is tangent to the curve produced by the orig-
inal algorithm (see Fig. 3).
3.2 Generalized ESM
The ESM method can be also be generalized using Eqs. 10–
12. Any derivative with respect to φf can be converted to
the space of φm before averaging it with the corresponding
derivative computed relative to φm.1 It is no longer neces-
sary to restrict the parameterization so that Eq. 7 is valid:
1However, the name “Efficient Second Order Method” may become a
misnomer when it is generalized in this way, as it is only a second order
approximation when the Jacobian pseudoinverse is being used.
Page 7
Int J Comput Vis (2010) 87: 191–212 197
Fig. 3 The cost function with respect to the fixed image parame-
ters (left) and the moving image parameters (right) is related with a
non-linear mapping. Thus a linear movement (A) of the fixed image pa-
rameters, φf , can be transformed to a (generally curved) movement (B)
of the moving image parameters, φm. With our proposed method, it is
the tangent to the curve at its beginning that is mapped, and the motion
is linear (C) in the moving image parameter space
Algorithm 5 Generalized “ESM” Image Alignment
Set start position φm0 ; φf = 0; iteration counter n = 0
Compute Jf = ∇φf If (φf )
Repeat
Compute D(φf ,φm(n)), ∇φm(n) D(φf ,φm(n)),
∇φf D(φf ,φm(n)), ∇φmD(φf ,φm(n)) and Jm(φm(n)) as
required
Convert Jf to the space of Jm using Eq. 12 and aver-
age with Jm to form ¯J .
Compute compute update step φm(n) using the cho-
sen optimization algorithm.
Set φm(n+1) = φm(n) + φ
−1
m(n) and n = n + 1
Until convergence criteria reached
any parameterization where the IC method will work is suit-
able. This leads to a generalized ESM algorithm shown in
Algorithm 5.
It is more difficult to compare the steps taken by the
generalized ESM algorithm to the ones taken by the orig-
inal algorithm. In the exponential map based parameteri-
zations used by the original implementation (Malis 2004;
Benhimane and Malis 2004; Malis and Benhimane 2005),
∂φf
∂φm
= −I and the steps are identical to those taken by Al-
gorithm 3. In the more general cases to which it can now be
applied the different parameterizations give rise to different
derivatives, different hyperplane approximations (in Fig. 2)
and thus different steps. The importance of this generaliza-
tion is that the ESM method can now be applied in a much
wider set of contexts than before.
The ∂φf∂φm matrix is key to the implementation of both
methods. Appendix A contains analytical evaluations of this
matrix for 2D homographies, and 3D rigid transformations.
The generalized approaches rely on the calculation of the
∂φf
∂φm
matrix at each iteration, followed by a matrix-vector
multiplication in the case of the derivative, and two ma-
trix multiplications in the case of the approximate Hessian.
These operations are of order O(n3), where n is the num-
ber of parameters. For typical image alignment problems
in 2 and 3 dimensions, the warp typically has tens of pa-
rameters while the images typically have thousands of pix-
els. Thus these operations are not the computational bottle-
neck in the process. In addition, the original IC method in-
cluded an inversion, and a composition of the warp parame-
ters, which are operations of equivalent complexity. There-
fore, our generalized methods retain the advantages of the
IC and ESM methods, which are, respectively, faster and
more accurate calculation of the update step. The general-
ized methods avoid the restrictions on optimization tech-
nique and transform parameterization that affected the origi-
nal methods, without incurring a greater computational cost.
4 Implementation
To assess the effectiveness of our generalized approach, we
have implemented a standard multiscale image registration
framework using the ITK library (Ibanez et al. 2005). Each
generalized approach described above has been implement-
ed, along with the classical, forwards additive method. We
have implemented each of these for three widely used image
similarity measures (MSD, NCC, and MI), and three widely
used optimizers (gradient descent, BFGS, and Newton-Ra-
phson). For comparison purposes, we have also implement-
ed an original IC method which takes steps compositionally
in the parameter space. For the case of the mean squared
metric and the Newton-Raphson optimizer the original IC
method we have implemented is nearly identical to the me-
thod described in (Baker and Matthews 2004), which pro-
vides a meaningful base-line comparison for our work. For
the other similarity measure and optimizer contexts, our
original IC implementation is not precisely comparable to
other published work, as we use a different optimization
strategy compared to (Brooks and Arbel 2006) and a differ-
ent MI formulation than (Dowson and Bowden 2008). Nev-
ertheless, it provides an interesting check that our general-
ization has not lost critical elements of the method.
4.1 Image Difference Measures
To use an image difference measure in an IC or ESM context
it must be possible to compute the derivative and Hessian
with respect to φf rather than φm. For the experimental
component of this work we have implemented three im-
age difference measures that are in wide use, specifically
the Mean Squared Difference (MSD) measure (Lucas and
Kanade 1981), the Normalized Cross-Correlation (NCC)
measure (Anuta 1969) and the Mutual Information (MI)
measure (Viola and Wells 1995). Mean squared difference
Fig. 3 The cost function with respect to the fixed image parame-
ters (left) and the moving image parameters (right) is related with a
non-linear mapping. Thus a linear movement (A) of the fixed image pa-
rameters, φf , can be transformed to a (generally curved) movement (B)
of the moving image parameters, φm. With our proposed method, it is
the tangent to the curve at its beginning that is mapped, and the motion
is linear (C) in the moving image parameter space
Algorithm 5 Generalized “ESM” Image Alignment
Set start position φm0 ; φf = 0; iteration counter n = 0
Compute Jf = ∇φf If (φf )
Repeat
Compute D(φf ,φm(n)), ∇φm(n) D(φf ,φm(n)),
∇φf D(φf ,φm(n)), ∇φmD(φf ,φm(n)) and Jm(φm(n)) as
required
Convert Jf to the space of Jm using Eq. 12 and aver-
age with Jm to form ¯J .
Compute compute update step φm(n) using the cho-
sen optimization algorithm.
Set φm(n+1) = φm(n) + φ
−1
m(n) and n = n + 1
Until convergence criteria reached
any parameterization where the IC method will work is suit-
able. This leads to a generalized ESM algorithm shown in
Algorithm 5.
It is more difficult to compare the steps taken by the
generalized ESM algorithm to the ones taken by the orig-
inal algorithm. In the exponential map based parameteri-
zations used by the original implementation (Malis 2004;
Benhimane and Malis 2004; Malis and Benhimane 2005),
∂φf
∂φm
= −I and the steps are identical to those taken by Al-
gorithm 3. In the more general cases to which it can now be
applied the different parameterizations give rise to different
derivatives, different hyperplane approximations (in Fig. 2)
and thus different steps. The importance of this generaliza-
tion is that the ESM method can now be applied in a much
wider set of contexts than before.
The ∂φf∂φm matrix is key to the implementation of both
methods. Appendix A contains analytical evaluations of this
matrix for 2D homographies, and 3D rigid transformations.
The generalized approaches rely on the calculation of the
∂φf
∂φm
matrix at each iteration, followed by a matrix-vector
multiplication in the case of the derivative, and two ma-
trix multiplications in the case of the approximate Hessian.
These operations are of order O(n3), where n is the num-
ber of parameters. For typical image alignment problems
in 2 and 3 dimensions, the warp typically has tens of pa-
rameters while the images typically have thousands of pix-
els. Thus these operations are not the computational bottle-
neck in the process. In addition, the original IC method in-
cluded an inversion, and a composition of the warp parame-
ters, which are operations of equivalent complexity. There-
fore, our generalized methods retain the advantages of the
IC and ESM methods, which are, respectively, faster and
more accurate calculation of the update step. The general-
ized methods avoid the restrictions on optimization tech-
nique and transform parameterization that affected the origi-
nal methods, without incurring a greater computational cost.
4 Implementation
To assess the effectiveness of our generalized approach, we
have implemented a standard multiscale image registration
framework using the ITK library (Ibanez et al. 2005). Each
generalized approach described above has been implement-
ed, along with the classical, forwards additive method. We
have implemented each of these for three widely used image
similarity measures (MSD, NCC, and MI), and three widely
used optimizers (gradient descent, BFGS, and Newton-Ra-
phson). For comparison purposes, we have also implement-
ed an original IC method which takes steps compositionally
in the parameter space. For the case of the mean squared
metric and the Newton-Raphson optimizer the original IC
method we have implemented is nearly identical to the me-
thod described in (Baker and Matthews 2004), which pro-
vides a meaningful base-line comparison for our work. For
the other similarity measure and optimizer contexts, our
original IC implementation is not precisely comparable to
other published work, as we use a different optimization
strategy compared to (Brooks and Arbel 2006) and a differ-
ent MI formulation than (Dowson and Bowden 2008). Nev-
ertheless, it provides an interesting check that our general-
ization has not lost critical elements of the method.
4.1 Image Difference Measures
To use an image difference measure in an IC or ESM context
it must be possible to compute the derivative and Hessian
with respect to φf rather than φm. For the experimental
component of this work we have implemented three im-
age difference measures that are in wide use, specifically
the Mean Squared Difference (MSD) measure (Lucas and
Kanade 1981), the Normalized Cross-Correlation (NCC)
measure (Anuta 1969) and the Mutual Information (MI)
measure (Viola and Wells 1995). Mean squared difference
Page 8
198 Int J Comput Vis (2010) 87: 191–212
is an appropriate measure to use when the corresponding in-
tensities in each image are expected to be identical. Normal-
ized correlation is appropriate for comparing images where
there is a linear relationship between the intensities in the
fixed and moving images, such as photographs under differ-
ent lighting conditions or satellite images in different spec-
tral bands. Both of these measures are widely known and
their derivatives are straightforward. Most importantly, the
roles of the fixed and moving images are totally symmetric;
thus, obtaining ∂D∂φf and
∂2D
∂φ2f
is easily done by reversing the
roles of If and Im in the expressions for each measure. We
have therefore chosen to place the full expressions, and the
derivation of their gradients and approximation Hessians, in
Appendix B.
4.1.1 Mutual Information
While NCC relies on the image intensities being correlated
when they are in proper alignment, MI is more general, and
reveals the degree of statistical dependence between the two
signals. Thus, it is appropriate for cases where the imaging
modalities are very different and the relationship between
the image intensities is complex. Implementation details are
quite important for achieving good results with this mea-
sure. We have based this work on the efficient formulation
of (Thévenaz and Unser 2000). This method accumulates
a Parzen-windowed model of the joint distribution, and its
derivatives over all the pixels.
This method of computing MI works as follows. Let
bfk , where k = 1 . . .K , be a set of K bins of width df
for the intensity values in the reference image starting at
bf0 = minx If (x). Similarly, let bml be the bins for the inten-
sity values in the moving image, where l = 1 . . .L, the bins
begin at bm0 = minx Im(x) and have width dm. Then the un-
normalized joint distribution is an array of size K × L. The
entry Pkl is equal to
Pkl(φm) =
N
∑
i=1
β0
(
k−
Ifi − bf0
df
)
β3
(
l−
Imi − bm0
dm
)
(13)
where β0 and β3 are 0th and 3rd order B-spline Parzen win-
dows respectively, and N is the number of pixels. The nor-
malization factor of Pkl is simply the number of pixels coun-
ted, N . This is true because the B-spline Parzen windows
satisfy the partition of unity constraint. That is, their contri-
butions to the table for each pixel will always sum to one,
regardless of the pixel value (see Thévenaz and Unser 2000,
p. 2085, for details). The mutual information can then be
computed using the usual formula:
DMI (φ) =
K
∑
k=1
L
∑
l=1
Pkl(φ)
N
log
NPkl(φ)
∑
k′ Pk′l(φ) ·
∑
l′ Pkl′(φ)
.
Unfortunately, this formulation is asymmetrical. Because
a 3rd order B-spline is differentiable, the gradient of the joint
histogram with respect to φm, ∇φPkl(φ), can be easily com-
puted. The derivative with respect to φf however, involves
the derivative of a zero-th order B-spline, which is unde-
fined.
To overcome this issue, we modify the technique to com-
pute this gradient by reversing the roles of the reference and
the fixed image for the gradient part of the calculation only.
That is, we accumulate an additional joint distribution,
Qkl(φ) =
N
∑
i=1
β0
(
k −
Imi − bm0
dm
)
β3
(
l −
Ifi − bf0
df
)
, (14)
and accumulate its derivatives in terms of φf instead of φm.
Then DMI is computed in the original way, but its gradient
and Hessian are computed by substituting Qkl for Pkl , and
swapping the roles of the summation indexes k and l. This
has the effect of increasing the computational load some-
what to compute the derivatives with respect to φf . We can
still obtain computational savings, but they are not as large
as they would be in a symmetric formulation of MI.
We found that in practice this formulation could expe-
rience some “jittering” of the calculated derivatives due to
small numbers of samples in some of the bins. To overcome
this problem, as suggested for other reasons in Toews et al.
(2005), we preseeded the bins of the joint histogram with a
uniform number of samples. For all MI calculations used for
this article, the joint histogram was preseeded with 10 sam-
ples in all bins. For brevity, we have placed the rest of the
details in Appendix B.
4.2 Optimization
We have performed our registrations using three optimiz-
ers: a gradient descent, a Newton-Raphson method, and a
quasi-Newton method, which are broadly representative of
the types of optimization algorithms being used in image
registration currently.
Gradient descent methods, despite having a number of
known problems, are still widely used due to their simplic-
ity. All such methods operate by taking steps in the direction
of the gradient, and the methods are mainly differentiated
by how the sizes of these steps are chosen. We have used a
trust-region approach to adapting the step size in our gradi-
ent descent method.
Briefly, trust-region optimization methods work by creat-
ing a model of the objective function at each step. The model
is considered to be good (i.e. trusted) inside the trust re-
gion. At each step, the model is minimized, within the region
where it is considered valid. This inner minimization is re-
ferred to as the trust-region subproblem. If the model agrees
well with the objective function the trust-region may be ex-
panded. If not, it will be contracted. Liu and Chen (2004)
is an appropriate measure to use when the corresponding in-
tensities in each image are expected to be identical. Normal-
ized correlation is appropriate for comparing images where
there is a linear relationship between the intensities in the
fixed and moving images, such as photographs under differ-
ent lighting conditions or satellite images in different spec-
tral bands. Both of these measures are widely known and
their derivatives are straightforward. Most importantly, the
roles of the fixed and moving images are totally symmetric;
thus, obtaining ∂D∂φf and
∂2D
∂φ2f
is easily done by reversing the
roles of If and Im in the expressions for each measure. We
have therefore chosen to place the full expressions, and the
derivation of their gradients and approximation Hessians, in
Appendix B.
4.1.1 Mutual Information
While NCC relies on the image intensities being correlated
when they are in proper alignment, MI is more general, and
reveals the degree of statistical dependence between the two
signals. Thus, it is appropriate for cases where the imaging
modalities are very different and the relationship between
the image intensities is complex. Implementation details are
quite important for achieving good results with this mea-
sure. We have based this work on the efficient formulation
of (Thévenaz and Unser 2000). This method accumulates
a Parzen-windowed model of the joint distribution, and its
derivatives over all the pixels.
This method of computing MI works as follows. Let
bfk , where k = 1 . . .K , be a set of K bins of width df
for the intensity values in the reference image starting at
bf0 = minx If (x). Similarly, let bml be the bins for the inten-
sity values in the moving image, where l = 1 . . .L, the bins
begin at bm0 = minx Im(x) and have width dm. Then the un-
normalized joint distribution is an array of size K × L. The
entry Pkl is equal to
Pkl(φm) =
N
∑
i=1
β0
(
k−
Ifi − bf0
df
)
β3
(
l−
Imi − bm0
dm
)
(13)
where β0 and β3 are 0th and 3rd order B-spline Parzen win-
dows respectively, and N is the number of pixels. The nor-
malization factor of Pkl is simply the number of pixels coun-
ted, N . This is true because the B-spline Parzen windows
satisfy the partition of unity constraint. That is, their contri-
butions to the table for each pixel will always sum to one,
regardless of the pixel value (see Thévenaz and Unser 2000,
p. 2085, for details). The mutual information can then be
computed using the usual formula:
DMI (φ) =
K
∑
k=1
L
∑
l=1
Pkl(φ)
N
log
NPkl(φ)
∑
k′ Pk′l(φ) ·
∑
l′ Pkl′(φ)
.
Unfortunately, this formulation is asymmetrical. Because
a 3rd order B-spline is differentiable, the gradient of the joint
histogram with respect to φm, ∇φPkl(φ), can be easily com-
puted. The derivative with respect to φf however, involves
the derivative of a zero-th order B-spline, which is unde-
fined.
To overcome this issue, we modify the technique to com-
pute this gradient by reversing the roles of the reference and
the fixed image for the gradient part of the calculation only.
That is, we accumulate an additional joint distribution,
Qkl(φ) =
N
∑
i=1
β0
(
k −
Imi − bm0
dm
)
β3
(
l −
Ifi − bf0
df
)
, (14)
and accumulate its derivatives in terms of φf instead of φm.
Then DMI is computed in the original way, but its gradient
and Hessian are computed by substituting Qkl for Pkl , and
swapping the roles of the summation indexes k and l. This
has the effect of increasing the computational load some-
what to compute the derivatives with respect to φf . We can
still obtain computational savings, but they are not as large
as they would be in a symmetric formulation of MI.
We found that in practice this formulation could expe-
rience some “jittering” of the calculated derivatives due to
small numbers of samples in some of the bins. To overcome
this problem, as suggested for other reasons in Toews et al.
(2005), we preseeded the bins of the joint histogram with a
uniform number of samples. For all MI calculations used for
this article, the joint histogram was preseeded with 10 sam-
ples in all bins. For brevity, we have placed the rest of the
details in Appendix B.
4.2 Optimization
We have performed our registrations using three optimiz-
ers: a gradient descent, a Newton-Raphson method, and a
quasi-Newton method, which are broadly representative of
the types of optimization algorithms being used in image
registration currently.
Gradient descent methods, despite having a number of
known problems, are still widely used due to their simplic-
ity. All such methods operate by taking steps in the direction
of the gradient, and the methods are mainly differentiated
by how the sizes of these steps are chosen. We have used a
trust-region approach to adapting the step size in our gradi-
ent descent method.
Briefly, trust-region optimization methods work by creat-
ing a model of the objective function at each step. The model
is considered to be good (i.e. trusted) inside the trust re-
gion. At each step, the model is minimized, within the region
where it is considered valid. This inner minimization is re-
ferred to as the trust-region subproblem. If the model agrees
well with the objective function the trust-region may be ex-
panded. If not, it will be contracted. Liu and Chen (2004)
Page 9
Int J Comput Vis (2010) 87: 191–212 199
Algorithm 6 Trust-Region Gradient Descent optimization
Set start position φ0; iteration counter n = 0; trust-region
radius R; scaling matrix M
Compute ∇φD(φ(0)) and D(φ(0))
Repeat
Compute φ(n+1) at intersection of negative scaled gra-
dient and trust-region boundary.
Compute Expected improvement E(i) = (φ(n+1) −
φ(n)) · ∇φD(φ(n)) − D(φ(n))
Compute ∇φD(φ(n+1)) and D(φ(n+1))
Compute ρ, the ratio of real to expected improvement.
ρ =
(D(φ(n+1))−D(φ(n)))
E(n)
If ρ < c0 then
{This is an unexpectedly poor result, so the model
is wrong. We reject the step and shrink the trust re-
gion.}
R = γ0R
Else if ρ < c1 then
{The improvement is acceptable relative to the
model, so we accept the step}
n = n + 1
Else
{This improvement is good or excellent relative to
the model. We accept the step, and expand the trust
region}
n = n + 1; R = γ1R
End if
Until convergence criteria reached
The constants c0, c1, γ0 and γ1 are 0.001, 0.1, 0.1 and 2 respec-
tively. These values were chosen based on the recommendations in
(Conn et al. 2000).
have made a good argument for the use of trust-region meth-
ods in image registration. In the case of gradient descent,
solving the trust-region subproblem is simple. The solution
is the intersection of the descent direction with the trust re-
gion boundary, which makes the method roughly equivalent
to gradient descent with adaptive step sizes. A detailed de-
scription is provided in Algorithm 6.
Quasi-Newton methods are widely used when the Hess-
ian is unavailable. They work by building up an approxi-
mation to the Hessian from the sequence of function gra-
dients that they receive. Of these methods, the Broyden-
Fletcher-Goldfarb-Shanno (BFGS) update is generally con-
sidered to be superior (Venkataraman 2001). We used the
limited memory BFGS implementation described by Byrd
et al. (1995). Briefly, the approach approximates the inverse
Hessian from a limited length sequence of the gradients that
it has received. At each direction, it selects a search direction
based on this estimate, and performs a line search in that di-
rection. Complete details can be found in Byrd et al. (1995).
The original IC algorithm does not support the Quasi-
Algorithm 7 Trust-Region Newton-Raphson optimization
Set start position φ0; iteration counter n = 0; trust-region
radius R; scaling matrix M
Compute HφD(φ(0)), ∇φD(φ(0)) and D(φ(0))
Repeat
Compute φ(n+1) using Steihaug-Toint method.
Compute Expected improvement E(n) = (φ(n+1) −
φ(n)) · ∇φD(φ(n)) − D(φ(n))
Compute HφD(φ(0)), ∇φD(φ(n+1)) and D(φ(n+1))
Compute ρ, the ratio of real to expected improvement.
ρ =
(D(φ(n+1))−D(φ(n)))
E(n)
If ρ < c0 then
{This is an unexpectedly poor result, so the model
is wrong. We reject the step and shrink the trust re-
gion.}
R = γ0R
Else if ρ < c1 or step is within the trust region then
{The improvement is acceptable relative to the
model, so we accept the step}
n = n + 1
Else
{This improvement is good or excellent relative to
the model and we are at the edge of the trust region.
We accept the step, and expand the trust region}
n = n + 1; R = γ1R
End if
Until convergence criteria reached
The constants c0, c1, γ0 and γ1 are 0.001, 0.1, 0.1 and 2 respec-
tively. These values were chosen based on the recommendations in
(Conn et al. 2000). The fact that the expected improvement is not
calculated with a quadratic model is intentional, as our experience
shows that this seems to improve the efficiency significantly.
Newton method, and so we did not perform Quasi-Newton
experiments with the original IC method in this paper.
Newton-Raphson optimization methods have been most
widely used with the inverse-compositional approach, in-
cluding Gauss-Newton methods in Baker and Matthews
(2004), and Levenberg-Marquardt methods in Dowson and
Bowden (2008). Trust-region Newton-Raphson methods are
closely related to the Levenberg-Marquardt method, but are
somewhat more recent and general (Conn et al. 2000). For
the second order model, solving the trust-region subprob-
lem is more complex. We have used the Steihaug-Toint al-
gorithm (Conn et al. 2000, p. 205) for this purpose. Al-
gorithm 7 describes the trust-region Newton-Raphson ap-
proach in detail. Details of the Steihaug-Toint solution can
be found in (Conn et al. 2000).
In the experiments that follow, we have applied our gen-
eralized method and the original method to a variety of data-
sets both synthetic and real. Not all similarity measures or
optimizers are appropriate for every context, but each of the
Algorithm 6 Trust-Region Gradient Descent optimization
Set start position φ0; iteration counter n = 0; trust-region
radius R; scaling matrix M
Compute ∇φD(φ(0)) and D(φ(0))
Repeat
Compute φ(n+1) at intersection of negative scaled gra-
dient and trust-region boundary.
Compute Expected improvement E(i) = (φ(n+1) −
φ(n)) · ∇φD(φ(n)) − D(φ(n))
Compute ∇φD(φ(n+1)) and D(φ(n+1))
Compute ρ, the ratio of real to expected improvement.
ρ =
(D(φ(n+1))−D(φ(n)))
E(n)
If ρ < c0 then
{This is an unexpectedly poor result, so the model
is wrong. We reject the step and shrink the trust re-
gion.}
R = γ0R
Else if ρ < c1 then
{The improvement is acceptable relative to the
model, so we accept the step}
n = n + 1
Else
{This improvement is good or excellent relative to
the model. We accept the step, and expand the trust
region}
n = n + 1; R = γ1R
End if
Until convergence criteria reached
The constants c0, c1, γ0 and γ1 are 0.001, 0.1, 0.1 and 2 respec-
tively. These values were chosen based on the recommendations in
(Conn et al. 2000).
have made a good argument for the use of trust-region meth-
ods in image registration. In the case of gradient descent,
solving the trust-region subproblem is simple. The solution
is the intersection of the descent direction with the trust re-
gion boundary, which makes the method roughly equivalent
to gradient descent with adaptive step sizes. A detailed de-
scription is provided in Algorithm 6.
Quasi-Newton methods are widely used when the Hess-
ian is unavailable. They work by building up an approxi-
mation to the Hessian from the sequence of function gra-
dients that they receive. Of these methods, the Broyden-
Fletcher-Goldfarb-Shanno (BFGS) update is generally con-
sidered to be superior (Venkataraman 2001). We used the
limited memory BFGS implementation described by Byrd
et al. (1995). Briefly, the approach approximates the inverse
Hessian from a limited length sequence of the gradients that
it has received. At each direction, it selects a search direction
based on this estimate, and performs a line search in that di-
rection. Complete details can be found in Byrd et al. (1995).
The original IC algorithm does not support the Quasi-
Algorithm 7 Trust-Region Newton-Raphson optimization
Set start position φ0; iteration counter n = 0; trust-region
radius R; scaling matrix M
Compute HφD(φ(0)), ∇φD(φ(0)) and D(φ(0))
Repeat
Compute φ(n+1) using Steihaug-Toint method.
Compute Expected improvement E(n) = (φ(n+1) −
φ(n)) · ∇φD(φ(n)) − D(φ(n))
Compute HφD(φ(0)), ∇φD(φ(n+1)) and D(φ(n+1))
Compute ρ, the ratio of real to expected improvement.
ρ =
(D(φ(n+1))−D(φ(n)))
E(n)
If ρ < c0 then
{This is an unexpectedly poor result, so the model
is wrong. We reject the step and shrink the trust re-
gion.}
R = γ0R
Else if ρ < c1 or step is within the trust region then
{The improvement is acceptable relative to the
model, so we accept the step}
n = n + 1
Else
{This improvement is good or excellent relative to
the model and we are at the edge of the trust region.
We accept the step, and expand the trust region}
n = n + 1; R = γ1R
End if
Until convergence criteria reached
The constants c0, c1, γ0 and γ1 are 0.001, 0.1, 0.1 and 2 respec-
tively. These values were chosen based on the recommendations in
(Conn et al. 2000). The fact that the expected improvement is not
calculated with a quadratic model is intentional, as our experience
shows that this seems to improve the efficiency significantly.
Newton method, and so we did not perform Quasi-Newton
experiments with the original IC method in this paper.
Newton-Raphson optimization methods have been most
widely used with the inverse-compositional approach, in-
cluding Gauss-Newton methods in Baker and Matthews
(2004), and Levenberg-Marquardt methods in Dowson and
Bowden (2008). Trust-region Newton-Raphson methods are
closely related to the Levenberg-Marquardt method, but are
somewhat more recent and general (Conn et al. 2000). For
the second order model, solving the trust-region subprob-
lem is more complex. We have used the Steihaug-Toint al-
gorithm (Conn et al. 2000, p. 205) for this purpose. Al-
gorithm 7 describes the trust-region Newton-Raphson ap-
proach in detail. Details of the Steihaug-Toint solution can
be found in (Conn et al. 2000).
In the experiments that follow, we have applied our gen-
eralized method and the original method to a variety of data-
sets both synthetic and real. Not all similarity measures or
optimizers are appropriate for every context, but each of the
Page 10
200 Int J Comput Vis (2010) 87: 191–212
similarity measures has been tested with each of the opti-
mizers in at least one context.
5 Experiments
The algorithms were tested by running numerous image reg-
istrations from different starting positions. Each algorithm
was applied using a multiresolution approach, with the num-
ber of levels dependent on the size of the image. Scaling of
the optimization parameters is important for any optimiza-
tion algorithm. We have scaled the parameters by the square
root of the diagonal of the Gauss-Newton MSD Hessian of
the fixed image. This is the optimal diagonal preconditioner
for least squares problems (Dennis et al. 1993), and yields
good results in our experience. Each optimizer used similar
stopping criteria, specifically, that the minimal scaled step
size should be 0.005 units, and the minimal scaled gradient
magnitude should be 0.05 units. A large bound (400) was
placed on the number of iterations and reaching that bound
was considered a failure. This was an extremely rare occur-
rence.
The performance of the algorithms was measured by re-
cording the time required, the positional accuracy, the num-
ber of function evaluations required and the failure rate. All
runs were performed on a 2.2 GHz Sun Opteron 875 using
a single threaded implementation. Run time is the time re-
quired to initialize and run the optimization, but does not
include the time to load the data from disk. Positional ac-
curacy was measured in terms of mean Target Registration
Error (mTRE), proposed as a measure for registration accu-
racy in (van de Kraats et al. 2005). This is calculated using a
10 × 10 grid (10 × 10 × 10 in the 3D case) of points evenly
spaced in the image extent. These are projected both through
the true, and the computed transform, and the mean distance
between final point positions is calculated. A particular run
of the algorithm was considered to have failed when either
the algorithm aborted and reported a failure, or when the
final position was more than 5 pixels away from the true po-
sition in mTRE.
Statistical significance testing of the results was perform-
ed to confirm that the differences detected are meaningful.
Since each algorithm under comparison was run on exactly
the same input data, pairwise comparisons were possible.
A pairwise test addresses the question of whether the dif-
ference between values for corresponding runs is greater or
less than zero. In general, pairwise testing is much more
sensitive than globally testing the results. Pairwise t-tests
(Sheskin 2000, pp. 433–461) were used for the numerical
results (time, mTRE, number of evaluations) and the Mc-
Nemar test (Sheskin 2000, pp. 491–508) was used for the
algorithm failure rate. The fact that multiple comparisons
were done between algorithms must be accounted for in the
Table 1 Percentage of registration time spent computing the per-pixel
derivative. This is a upper bound on the speedup achievable using the
inverse compositional method for registration of an image pair (Ap-
plications which require multiple registrations to the same reference
image, such as tracking, may have a greater overall gain)
Measure MSD NCC MI
Percent time 38% 36% 29%
test results. The t-tests were corrected for multiple compar-
isons using the Tukey method (Sheskin 2000, pp. 534–535),
and the McNemar tests were corrected using the Bonferroni-
Dunn method (Sheskin 2000, pp. 531–534). The correction
for multiple comparisons reduces the power of the test—
something that could help our argument in certain cases. To
avoid artificially bolstering our claims, if a particular com-
parison indicated a difference before the multiple compari-
son correction, this is indicated as an ambiguous result.
It is worth considering just how much speed up could
possibly be achieved by this method. Each evaluation of the
cost function also requires the calculation of its derivative
and possibly its Hessian. The calculation of the derivative
requires the computation of the derivative at each pixel, and
the summation of these values. The IC method speeds up
the computation of the derivative at each pixel (the per-pixel
derivative). For Hessian based methods, the IC method al-
lows caching of the Hessian, so there are some further sav-
ings.
There are at least two major ways to implement the calcu-
lation of ∂D∂φm in the forwards additive approach. One, used
by (Baker and Matthews 2004), warps the image and com-
putes the derivative on this warped image at each iteration.
The other approach is to compute the derivative of the image
once, cache it, and warp this derivative at each evaluation.
The second approach is much faster, and is what we have
used here for comparison.
To determine just how much speed up could possibly be
achieved under the best possible conditions, we used the
Callgrind profiler (Weidendorfer et al. 2004) to determine
approximately what percentage of the computation is spent
on evaluating the per-pixel derivative. This forms an up-
per bound on the speedup achievable by the IC method for
gradient based methods. This upper bound cannot ever be
achieved in practice because some time must be spent com-
puting the derivative once, and time must still be spent to
retrieve the data from memory. Nevertheless, it provides a
guide for evaluating the effectiveness of an IC implemen-
tation. The results of the timings are shown in Table 1 and
indicate that we can hope to speed up the time by roughly
a third, with the improvement for the MI metric likely to be
a little less than the others. Some additional improvement
may be possible for the Hessian based methods, as there is
no longer any need even to sum up the computed derivatives
once the Hessian is cached.
similarity measures has been tested with each of the opti-
mizers in at least one context.
5 Experiments
The algorithms were tested by running numerous image reg-
istrations from different starting positions. Each algorithm
was applied using a multiresolution approach, with the num-
ber of levels dependent on the size of the image. Scaling of
the optimization parameters is important for any optimiza-
tion algorithm. We have scaled the parameters by the square
root of the diagonal of the Gauss-Newton MSD Hessian of
the fixed image. This is the optimal diagonal preconditioner
for least squares problems (Dennis et al. 1993), and yields
good results in our experience. Each optimizer used similar
stopping criteria, specifically, that the minimal scaled step
size should be 0.005 units, and the minimal scaled gradient
magnitude should be 0.05 units. A large bound (400) was
placed on the number of iterations and reaching that bound
was considered a failure. This was an extremely rare occur-
rence.
The performance of the algorithms was measured by re-
cording the time required, the positional accuracy, the num-
ber of function evaluations required and the failure rate. All
runs were performed on a 2.2 GHz Sun Opteron 875 using
a single threaded implementation. Run time is the time re-
quired to initialize and run the optimization, but does not
include the time to load the data from disk. Positional ac-
curacy was measured in terms of mean Target Registration
Error (mTRE), proposed as a measure for registration accu-
racy in (van de Kraats et al. 2005). This is calculated using a
10 × 10 grid (10 × 10 × 10 in the 3D case) of points evenly
spaced in the image extent. These are projected both through
the true, and the computed transform, and the mean distance
between final point positions is calculated. A particular run
of the algorithm was considered to have failed when either
the algorithm aborted and reported a failure, or when the
final position was more than 5 pixels away from the true po-
sition in mTRE.
Statistical significance testing of the results was perform-
ed to confirm that the differences detected are meaningful.
Since each algorithm under comparison was run on exactly
the same input data, pairwise comparisons were possible.
A pairwise test addresses the question of whether the dif-
ference between values for corresponding runs is greater or
less than zero. In general, pairwise testing is much more
sensitive than globally testing the results. Pairwise t-tests
(Sheskin 2000, pp. 433–461) were used for the numerical
results (time, mTRE, number of evaluations) and the Mc-
Nemar test (Sheskin 2000, pp. 491–508) was used for the
algorithm failure rate. The fact that multiple comparisons
were done between algorithms must be accounted for in the
Table 1 Percentage of registration time spent computing the per-pixel
derivative. This is a upper bound on the speedup achievable using the
inverse compositional method for registration of an image pair (Ap-
plications which require multiple registrations to the same reference
image, such as tracking, may have a greater overall gain)
Measure MSD NCC MI
Percent time 38% 36% 29%
test results. The t-tests were corrected for multiple compar-
isons using the Tukey method (Sheskin 2000, pp. 534–535),
and the McNemar tests were corrected using the Bonferroni-
Dunn method (Sheskin 2000, pp. 531–534). The correction
for multiple comparisons reduces the power of the test—
something that could help our argument in certain cases. To
avoid artificially bolstering our claims, if a particular com-
parison indicated a difference before the multiple compari-
son correction, this is indicated as an ambiguous result.
It is worth considering just how much speed up could
possibly be achieved by this method. Each evaluation of the
cost function also requires the calculation of its derivative
and possibly its Hessian. The calculation of the derivative
requires the computation of the derivative at each pixel, and
the summation of these values. The IC method speeds up
the computation of the derivative at each pixel (the per-pixel
derivative). For Hessian based methods, the IC method al-
lows caching of the Hessian, so there are some further sav-
ings.
There are at least two major ways to implement the calcu-
lation of ∂D∂φm in the forwards additive approach. One, used
by (Baker and Matthews 2004), warps the image and com-
putes the derivative on this warped image at each iteration.
The other approach is to compute the derivative of the image
once, cache it, and warp this derivative at each evaluation.
The second approach is much faster, and is what we have
used here for comparison.
To determine just how much speed up could possibly be
achieved under the best possible conditions, we used the
Callgrind profiler (Weidendorfer et al. 2004) to determine
approximately what percentage of the computation is spent
on evaluating the per-pixel derivative. This forms an up-
per bound on the speedup achievable by the IC method for
gradient based methods. This upper bound cannot ever be
achieved in practice because some time must be spent com-
puting the derivative once, and time must still be spent to
retrieve the data from memory. Nevertheless, it provides a
guide for evaluating the effectiveness of an IC implemen-
tation. The results of the timings are shown in Table 1 and
indicate that we can hope to speed up the time by roughly
a third, with the improvement for the MI metric likely to be
a little less than the others. Some additional improvement
may be possible for the Hessian based methods, as there is
no longer any need even to sum up the computed derivatives
once the Hessian is cached.
Page 11
Int J Comput Vis (2010) 87: 191–212 201
We present four sets of experiments with these algo-
rithms. One of the only ways to have a true gold standard for
accuracy is to warp images by a known transformation and
try to recover it. Experiments 1 and 2 utilize images synthet-
ically warped by known, random transforms. Experiment 1
consists of 2D images, warped by a homography, and Ex-
periment 2 consists of 3D image volumes warped by a rigid
3D transformation. We then recover these transforms with
our algorithms and evaluate the accuracy. While syntheti-
cally warped data gives us insight into the true accuracy of
the algorithms, working with real data often provides more
insight about the reliability of an image registration method.
Experiments 3 and 4 present experiments on real cases sim-
ilar to the simulated cases in experiments 1 and 2.
5.1 Experiment Set 1: Synthetic Homographies
When two images are taken from a perspective camera that
is rotating about its optical center, or when a moving per-
spective camera takes multiple images of a planar scene,
those images are related by a homography. Not surprisingly,
the homography is one of the most commonly used trans-
forms in computer vision applications. Ten synthetic homo-
graphies were generated and used to warp the image shown
in Fig. 4(a). The central part of the image was then cropped
out and used for the registration tests. This cropping process
avoided having the edge of the image in the region to be reg-
istered, which could have skewed the results. The starting
image, and an example of the warped input data are shown
in Fig. 4. All work was done with greyscale versions of the
images.
For the first part of this experiment, each warped image
was registered to the original from 80 starting positions at
distances from 2.6 to 81 pixels in mTRE. All registrations
were attempted in both directions, that is, both choices for
which image was fixed and which was moving were tried.
Four multiresolution levels were used, and, for speed rea-
sons, the cost functions were evaluated using a randomly
chosen 30% of the pixels. Each of the three cost functions
under investigation was used to register the data. The results
are shown in Fig. 5.
The results shown in Fig. 5(a) show a clear ranking in
terms of time required to complete with ESM being the
Fig. 4 Example of input data for Experiment 1
slowest, followed by the classical algorithm, and an approxi-
mate tie between our generalized IC method and the original
IC method. This implies that our approach is generalizing
the original IC method, without losing any of its speed ad-
vantages. The one exception is that the IC BFGS method for
MI does not show a speed advantage. Upon examining the
number of evaluations, we see that this is because it has re-
quired more evaluations to complete. It is our opinion that
this effect occurs because of the way we formulated the gra-
dient. Due to the asymmetry of the problem, our IC gradient
is slightly less accurate than the forwards gradient. The line
search used in the BFGS algorithm forms a cubic model of
the function using the gradient, and is sensitive to this. This
does not seem to affect the other optimizers at all.
Fig. 5 (Color online) Experimental results for synthetic homo-
graphies, case 1: starting normal distances from the true posi-
tion. The optimization algorithms are Gradient Descent (GD), Broy-
den-Fletcher-Goldfarb-Shanno (BFGS) and Newton-Raphson (N). For
each algorithm, bars are shown for the following methods from left to
right: 1: ESM (Red) 2: Classical Forwards Additive (Black) 3: Gen-
eralized IC (Blue) and 4: Original IC (Green). Dark bars are statisti-
cally significantly different from classical, light bars are not signifi-
cantly different and outline bars are ambiguous (see text). No original
IC method was implemented for the BFGS optimizer
We present four sets of experiments with these algo-
rithms. One of the only ways to have a true gold standard for
accuracy is to warp images by a known transformation and
try to recover it. Experiments 1 and 2 utilize images synthet-
ically warped by known, random transforms. Experiment 1
consists of 2D images, warped by a homography, and Ex-
periment 2 consists of 3D image volumes warped by a rigid
3D transformation. We then recover these transforms with
our algorithms and evaluate the accuracy. While syntheti-
cally warped data gives us insight into the true accuracy of
the algorithms, working with real data often provides more
insight about the reliability of an image registration method.
Experiments 3 and 4 present experiments on real cases sim-
ilar to the simulated cases in experiments 1 and 2.
5.1 Experiment Set 1: Synthetic Homographies
When two images are taken from a perspective camera that
is rotating about its optical center, or when a moving per-
spective camera takes multiple images of a planar scene,
those images are related by a homography. Not surprisingly,
the homography is one of the most commonly used trans-
forms in computer vision applications. Ten synthetic homo-
graphies were generated and used to warp the image shown
in Fig. 4(a). The central part of the image was then cropped
out and used for the registration tests. This cropping process
avoided having the edge of the image in the region to be reg-
istered, which could have skewed the results. The starting
image, and an example of the warped input data are shown
in Fig. 4. All work was done with greyscale versions of the
images.
For the first part of this experiment, each warped image
was registered to the original from 80 starting positions at
distances from 2.6 to 81 pixels in mTRE. All registrations
were attempted in both directions, that is, both choices for
which image was fixed and which was moving were tried.
Four multiresolution levels were used, and, for speed rea-
sons, the cost functions were evaluated using a randomly
chosen 30% of the pixels. Each of the three cost functions
under investigation was used to register the data. The results
are shown in Fig. 5.
The results shown in Fig. 5(a) show a clear ranking in
terms of time required to complete with ESM being the
Fig. 4 Example of input data for Experiment 1
slowest, followed by the classical algorithm, and an approxi-
mate tie between our generalized IC method and the original
IC method. This implies that our approach is generalizing
the original IC method, without losing any of its speed ad-
vantages. The one exception is that the IC BFGS method for
MI does not show a speed advantage. Upon examining the
number of evaluations, we see that this is because it has re-
quired more evaluations to complete. It is our opinion that
this effect occurs because of the way we formulated the gra-
dient. Due to the asymmetry of the problem, our IC gradient
is slightly less accurate than the forwards gradient. The line
search used in the BFGS algorithm forms a cubic model of
the function using the gradient, and is sensitive to this. This
does not seem to affect the other optimizers at all.
Fig. 5 (Color online) Experimental results for synthetic homo-
graphies, case 1: starting normal distances from the true posi-
tion. The optimization algorithms are Gradient Descent (GD), Broy-
den-Fletcher-Goldfarb-Shanno (BFGS) and Newton-Raphson (N). For
each algorithm, bars are shown for the following methods from left to
right: 1: ESM (Red) 2: Classical Forwards Additive (Black) 3: Gen-
eralized IC (Blue) and 4: Original IC (Green). Dark bars are statisti-
cally significantly different from classical, light bars are not signifi-
cantly different and outline bars are ambiguous (see text). No original
IC method was implemented for the BFGS optimizer
Page 12
202 Int J Comput Vis (2010) 87: 191–212
Fig. 6 (Color online) Experimental results for synthetic ho-
mographies, case 2: starting far from the correct transform.
The optimization algorithms are Gradient Descent (GD), Broy-
den-Fletcher-Goldfarb-Shanno (BFGS) and Newton-Raphson (N). For
each algorithm, bars are shown for the following methods from left to
right: 1: ESM (Red) 2: Classical Forwards Additive (Black) 3: Gen-
eralized IC (Blue) and 4: Original IC (Green). Dark bars are statisti-
cally significantly different from classical, light bars are not signifi-
cantly different and outline bars are ambiguous (see text). No original
IC method was implemented for the BFGS optimizer
Other than the issues with the BFGS method, the differ-
ences in accuracy (as measured by mTRE) and number of
evaluations are very small. Nevertheless, several of the bars
are colored as statistically significant. Upon investigation of
the data, this occurs because the IC method has a tendency
to stop slightly early. Because the tendency is consistently
biased one way, even if small, the pairwise statistical tests
can pick it up. Its effect on the overall results is very minor.
While the ESM method does not have any speed advan-
tage for this problem configuration, it does show a reliability
advantage over the classical method. Direct image registra-
tion using this approach is dependent on the capture radius
of the optimizer being used. The capture radius is the dis-
tance (in parameter space) from the true position that the
optimizer can be started and still be expected to converge.
The results of this experiment shown in Fig. 5 suggest that
the ESM may improve the capture radius and thus give more
reliable results. In this experiment, all the failure rates are
small and it is difficult to draw conclusions. For this rea-
son, another experiment was required to test this conjecture.
A second set of similar experiments (Fig. 6) were performed
starting far from the correct registration. For these experi-
ments, each image was registered from 50 starting positions
with mTRE ranging from 36 to 176 pixels. Not surprisingly,
failure rates in all cases are higher, and this allows us to
see that the failure rate for the ESM method is significantly
lower than for the other methods. Intuitively, this is because
the optimizer chooses the average of the directions calcu-
lated by the IC and forwards methods. If only one of the two
methods is outside its capture radius, the other may provide
sufficient information for the optimizer to converge.
Similar to the work of Dowson and Bowden (2008), we
find in this experiment that the Inverse Compositional meth-
ods have a slightly smaller capture radius. This cannot be
only due to the inclusion or exclusion of certain pixels in the
Hessian matrix since the issue is also present in optimizers
that do not use the Hessian.
These results also show clear differences between the op-
timizers. The BFGS optimizer and Newton-Raphson opti-
mizer are generally more accurate and require fewer itera-
tions than the gradient descent optimizer on the MSD and
NCC metrics. On the MI metric, the BFGS optimizer takes
more iterations than the other two, for all methods. This is
probably due to the negative curvature in the MI function.
The MI function is saddle shaped, meaning its Hessian is
not positive definite, but the BFGS update method enforces
a constraint that the Hessian be positive definite. We con-
sider that this mismatch between the algorithm assumptions
and the cost function properties is causing this optimizer to
require more iterations.
5.2 Experiment Set 2: Synthetic 3D Rigid Transformations
Only limited work with the inverse compositional method
on 3D images has been reported (Andreopoulos and Tsot-
sos 2005; Baker et al. 2004), although its efficiency gains
are particularly relevant to this problem due to the size of
the images. In this experiment, a synthetic Magnetic Res-
onance Imaging (MRI) volume generated using the Brain-
web (Kwan et al. 1999) software was used. Five random
rigid transformations were created, and the original image
was warped by each of them. Registration was then per-
formed from 45 starting positions ranging from 2 to 43 mm
in mTRE.
As the brain image (illustrated in Fig. 7) was surrounded
by a black background, we considered that edge effects were
small, and did not perform a cropping step as it was in the
Fig. 6 (Color online) Experimental results for synthetic ho-
mographies, case 2: starting far from the correct transform.
The optimization algorithms are Gradient Descent (GD), Broy-
den-Fletcher-Goldfarb-Shanno (BFGS) and Newton-Raphson (N). For
each algorithm, bars are shown for the following methods from left to
right: 1: ESM (Red) 2: Classical Forwards Additive (Black) 3: Gen-
eralized IC (Blue) and 4: Original IC (Green). Dark bars are statisti-
cally significantly different from classical, light bars are not signifi-
cantly different and outline bars are ambiguous (see text). No original
IC method was implemented for the BFGS optimizer
Other than the issues with the BFGS method, the differ-
ences in accuracy (as measured by mTRE) and number of
evaluations are very small. Nevertheless, several of the bars
are colored as statistically significant. Upon investigation of
the data, this occurs because the IC method has a tendency
to stop slightly early. Because the tendency is consistently
biased one way, even if small, the pairwise statistical tests
can pick it up. Its effect on the overall results is very minor.
While the ESM method does not have any speed advan-
tage for this problem configuration, it does show a reliability
advantage over the classical method. Direct image registra-
tion using this approach is dependent on the capture radius
of the optimizer being used. The capture radius is the dis-
tance (in parameter space) from the true position that the
optimizer can be started and still be expected to converge.
The results of this experiment shown in Fig. 5 suggest that
the ESM may improve the capture radius and thus give more
reliable results. In this experiment, all the failure rates are
small and it is difficult to draw conclusions. For this rea-
son, another experiment was required to test this conjecture.
A second set of similar experiments (Fig. 6) were performed
starting far from the correct registration. For these experi-
ments, each image was registered from 50 starting positions
with mTRE ranging from 36 to 176 pixels. Not surprisingly,
failure rates in all cases are higher, and this allows us to
see that the failure rate for the ESM method is significantly
lower than for the other methods. Intuitively, this is because
the optimizer chooses the average of the directions calcu-
lated by the IC and forwards methods. If only one of the two
methods is outside its capture radius, the other may provide
sufficient information for the optimizer to converge.
Similar to the work of Dowson and Bowden (2008), we
find in this experiment that the Inverse Compositional meth-
ods have a slightly smaller capture radius. This cannot be
only due to the inclusion or exclusion of certain pixels in the
Hessian matrix since the issue is also present in optimizers
that do not use the Hessian.
These results also show clear differences between the op-
timizers. The BFGS optimizer and Newton-Raphson opti-
mizer are generally more accurate and require fewer itera-
tions than the gradient descent optimizer on the MSD and
NCC metrics. On the MI metric, the BFGS optimizer takes
more iterations than the other two, for all methods. This is
probably due to the negative curvature in the MI function.
The MI function is saddle shaped, meaning its Hessian is
not positive definite, but the BFGS update method enforces
a constraint that the Hessian be positive definite. We con-
sider that this mismatch between the algorithm assumptions
and the cost function properties is causing this optimizer to
require more iterations.
5.2 Experiment Set 2: Synthetic 3D Rigid Transformations
Only limited work with the inverse compositional method
on 3D images has been reported (Andreopoulos and Tsot-
sos 2005; Baker et al. 2004), although its efficiency gains
are particularly relevant to this problem due to the size of
the images. In this experiment, a synthetic Magnetic Res-
onance Imaging (MRI) volume generated using the Brain-
web (Kwan et al. 1999) software was used. Five random
rigid transformations were created, and the original image
was warped by each of them. Registration was then per-
formed from 45 starting positions ranging from 2 to 43 mm
in mTRE.
As the brain image (illustrated in Fig. 7) was surrounded
by a black background, we considered that edge effects were
small, and did not perform a cropping step as it was in the
Page 13
Int J Comput Vis (2010) 87: 191–212 203
Fig. 7 Slices through the center of the Brainweb volume used. Top
row, original volume, bottom row, warped volume
previous experiment. Three multiresolution levels were used
in the optimization, and, as in the previous experiments, a
random 30% of the image voxels were used to calculate the
measures. To save time and space, and because MSD and
MI are the most widely used measures in medical imaging
applications, we have not performed these experiments with
NCC. Also, as the results for gradient descent in the pre-
vious experiment were consistent with the other methods,
but it was slower and less accurate than BFGS or Newton-
Raphson, we omitted it from these experiments.
Overall, the results shown in Fig. 8 are entirely consis-
tent with the results of experiment 1 using synthetic homo-
graphies. The generalized IC approach is consistently faster
than the classical approach, even for the BFGS optimizer in
this case. This is interesting to demonstrate, as the mapping
φf (φm) is much more non-linear in this case than in the
previous case (see Appendix A). This means that the differ-
ence between the path explored by our generalized method
and the original IC method (see Fig. 3) is more significant in
experiment 2. That the generalized approach still achieves
speed gains here is a sign of its effectiveness.
There is a clear, and statistically significant difference in
the time required for the algorithms to run, with ESM be-
ing slower than the classical, forwards additive, method, and
with both IC methods being faster. The differences in evalu-
ations and mTRE are tiny, but in certain cases appear signif-
icant. As in the previous experiment, this is due to a slight
but consistent bias—in this case the IC method is occasion-
ally taking an additional iteration to complete, and gaining
a marginal (on the order of 0.01 pixels) amount of accu-
racy. Such a tiny difference does not have any practical im-
pact. The failure rates on this problem were very low and no
statistically significant differences between algorithms were
detected. Overall, our generalized IC method is faster than
the classical method, without increasing the failure rate.
Fig. 8 (Color online) Experimental results for synthetic rigid 3D
transforms for MSD (left) and MI (right). The optimization algorithms
are Broyden-Fletcher-Goldfarb-Shanno (BFGS) and Newton-Raphson
(N). For each algorithm, bars are shown for the following methods
from left to right: 1: ESM (Red) 2: Classical Forwards Additive (Black)
3: Generalized IC (Blue) and 4: Original IC (Green). Dark bars are sta-
tistically significantly different from classical, light bars are not signif-
icantly different and outline bars are ambiguous (see text). No original
IC method was implemented for the BFGS optimizer
5.3 Experiment Set 3: Real Homographies
The stitching together of images into larger panoramas is a
common reason to register digital photographs. In this ex-
periment, 8 image pairs from different scenes suitable for
panoramic mosaicing were used. For each pair, the camera
was rotated about its optical center before the second image
was taken. Under these conditions, the transformation be-
tween the images can be modeled by a homography. A form
of ground truth was created using the Autostitch software
(Brown and Lowe 2007), and manually checked for accu-
racy. However, this ground truth can only be considered a
silver standard—an independent comparison, but not ab-
solutely accurate.
These images present a moderately difficult registration
problem. One reason is that the amount of overlap is limited.
Furthermore, no correction for lens distortion was made;
thus, the true transformation from image to image is not a
perfect homography. Nevertheless, we find that these condi-
tions are realistic, and make the problem more interesting.
Figure 9 shows a typical image pair from the set.
For this experiment, the images were registered from 20
starting positions ranging from approximately 3 to 19 pix-
els in mTRE. All registrations were attempted in both direc-
tions, that is, both choices for which image was fixed and
which was moving were tried. Due to the significant light-
ing changes between images in any given pair, MSD was
Fig. 7 Slices through the center of the Brainweb volume used. Top
row, original volume, bottom row, warped volume
previous experiment. Three multiresolution levels were used
in the optimization, and, as in the previous experiments, a
random 30% of the image voxels were used to calculate the
measures. To save time and space, and because MSD and
MI are the most widely used measures in medical imaging
applications, we have not performed these experiments with
NCC. Also, as the results for gradient descent in the pre-
vious experiment were consistent with the other methods,
but it was slower and less accurate than BFGS or Newton-
Raphson, we omitted it from these experiments.
Overall, the results shown in Fig. 8 are entirely consis-
tent with the results of experiment 1 using synthetic homo-
graphies. The generalized IC approach is consistently faster
than the classical approach, even for the BFGS optimizer in
this case. This is interesting to demonstrate, as the mapping
φf (φm) is much more non-linear in this case than in the
previous case (see Appendix A). This means that the differ-
ence between the path explored by our generalized method
and the original IC method (see Fig. 3) is more significant in
experiment 2. That the generalized approach still achieves
speed gains here is a sign of its effectiveness.
There is a clear, and statistically significant difference in
the time required for the algorithms to run, with ESM be-
ing slower than the classical, forwards additive, method, and
with both IC methods being faster. The differences in evalu-
ations and mTRE are tiny, but in certain cases appear signif-
icant. As in the previous experiment, this is due to a slight
but consistent bias—in this case the IC method is occasion-
ally taking an additional iteration to complete, and gaining
a marginal (on the order of 0.01 pixels) amount of accu-
racy. Such a tiny difference does not have any practical im-
pact. The failure rates on this problem were very low and no
statistically significant differences between algorithms were
detected. Overall, our generalized IC method is faster than
the classical method, without increasing the failure rate.
Fig. 8 (Color online) Experimental results for synthetic rigid 3D
transforms for MSD (left) and MI (right). The optimization algorithms
are Broyden-Fletcher-Goldfarb-Shanno (BFGS) and Newton-Raphson
(N). For each algorithm, bars are shown for the following methods
from left to right: 1: ESM (Red) 2: Classical Forwards Additive (Black)
3: Generalized IC (Blue) and 4: Original IC (Green). Dark bars are sta-
tistically significantly different from classical, light bars are not signif-
icantly different and outline bars are ambiguous (see text). No original
IC method was implemented for the BFGS optimizer
5.3 Experiment Set 3: Real Homographies
The stitching together of images into larger panoramas is a
common reason to register digital photographs. In this ex-
periment, 8 image pairs from different scenes suitable for
panoramic mosaicing were used. For each pair, the camera
was rotated about its optical center before the second image
was taken. Under these conditions, the transformation be-
tween the images can be modeled by a homography. A form
of ground truth was created using the Autostitch software
(Brown and Lowe 2007), and manually checked for accu-
racy. However, this ground truth can only be considered a
silver standard—an independent comparison, but not ab-
solutely accurate.
These images present a moderately difficult registration
problem. One reason is that the amount of overlap is limited.
Furthermore, no correction for lens distortion was made;
thus, the true transformation from image to image is not a
perfect homography. Nevertheless, we find that these condi-
tions are realistic, and make the problem more interesting.
Figure 9 shows a typical image pair from the set.
For this experiment, the images were registered from 20
starting positions ranging from approximately 3 to 19 pix-
els in mTRE. All registrations were attempted in both direc-
tions, that is, both choices for which image was fixed and
which was moving were tried. Due to the significant light-
ing changes between images in any given pair, MSD was
Page 14
204 Int J Comput Vis (2010) 87: 191–212
Fig. 9 A typical image pair for Experiment 3
Fig. 10 (Color online) Experimental results for registrations of real
homographies using NCC (left) and MI (right). The optimization algo-
rithms are Gradient Descent (GD), Broyden-Fletcher-Goldfarb-Shanno
(BFGS) and Newton-Raphson (N). For each algorithm, bars are shown
for the following methods from left to right: 1: ESM (Red) 2: Classi-
cal Forwards Additive (Black) 3: Generalized IC (Blue) and 4: Origi-
nal IC (Green). Dark bars are statistically significantly different from
classical, light bars are not significantly different and outline bars are
ambiguous (see text). No original IC method was implemented for the
BFGS optimizer
an ineffective measure, and was not used for these experi-
ments. As in experiment 1, four multiresolution levels were
used, and a random 30% of the pixels were taken to compute
the measures. All registrations were performed on greyscale
versions of the images.
For brevity, we have combined the results for all images
in the graphs presented in Fig. 10. The images are of dif-
ferent sizes and characters, so these numbers must be in-
terpreted in that context. (The statistical significance com-
parisons are pairwise, however, and are therefore valid.) As
in the previous two experiments, there was little interesting
difference in the number of evaluations and the mTRE, so
we have not presented these results graphically. It is worth
noting that the mTRE was higher than in the synthetic ex-
amples in all cases, averaging about 2 pixels. This is due to
Fig. 11 Slices through MR, CT and 3DRX images of cadaver verte-
brae provided for registration testing in van de Kraats et al. (2005)
the inexact nature of the standard to which we are compar-
ing, lever effects due to large areas (i.e. sky) without overlap,
and residual lens distortions in the images.
Once again, there is a clear and statistically significant
ranking in terms of speed, with the generalized ESM ap-
proach being the slowest, followed by the classic forwards
additive approach, with the generalized IC approach being
the fastest in all cases except the BFGS optimizer for MI.
The original IC method does not perform as well in terms of
speed on these images, and in the case of Newton-Raphson
optimization of NCC is actually slower than the classical,
forwards additive, method (see Fig. 10(a)). This slow perfor-
mance is most likely related to its even poorer performance
in terms of reliability.
The generalized IC algorithm performed as well in terms
of failure rate as the classical algorithm, with only the New-
ton-Raphson, NCC case showing an ambiguous difference
in Fig. 10(b). However, the original IC method proved very
unreliable on these images, showing a striking increase in
failure rate compared to all other algorithms. We speculate
that this was due to residual lens distortions invalidating the
compositional update of the original algorithm. As shown in
Fig. 3, the compositional update traces out a curved path in
the moving image parameter space. The validity of this path
is determined by the validity of the transformation model.
Converting the derivative, on the other hand, only relies on
the validity of the tangent space local to the current point on
the manifold of transformations. These results indicate that
this is a more robust approach.
5.4 Experiment Set 4: Real Rigid 3D Transformations
In (van de Kraats et al. 2005), several 3D volumes in dif-
ferent modalities intended for use in testing of registration
algorithms are described. They registered these images to a
subpixel level of accuracy using a mutual information based
technique, which was verified against fiducial markers (see
van de Kraats et al. (2005) for details). This provides a sil-
ver standard for registration, in that the registration has been
arrived at independently of the research described here. The
datasets consist of volumetric images of cadaver vertebrae
obtained with magnetic resonance (MR), computed tomog-
raphy (CT), and 3D X-Ray (3DRX) imaging techniques.
Fig. 9 A typical image pair for Experiment 3
Fig. 10 (Color online) Experimental results for registrations of real
homographies using NCC (left) and MI (right). The optimization algo-
rithms are Gradient Descent (GD), Broyden-Fletcher-Goldfarb-Shanno
(BFGS) and Newton-Raphson (N). For each algorithm, bars are shown
for the following methods from left to right: 1: ESM (Red) 2: Classi-
cal Forwards Additive (Black) 3: Generalized IC (Blue) and 4: Origi-
nal IC (Green). Dark bars are statistically significantly different from
classical, light bars are not significantly different and outline bars are
ambiguous (see text). No original IC method was implemented for the
BFGS optimizer
an ineffective measure, and was not used for these experi-
ments. As in experiment 1, four multiresolution levels were
used, and a random 30% of the pixels were taken to compute
the measures. All registrations were performed on greyscale
versions of the images.
For brevity, we have combined the results for all images
in the graphs presented in Fig. 10. The images are of dif-
ferent sizes and characters, so these numbers must be in-
terpreted in that context. (The statistical significance com-
parisons are pairwise, however, and are therefore valid.) As
in the previous two experiments, there was little interesting
difference in the number of evaluations and the mTRE, so
we have not presented these results graphically. It is worth
noting that the mTRE was higher than in the synthetic ex-
amples in all cases, averaging about 2 pixels. This is due to
Fig. 11 Slices through MR, CT and 3DRX images of cadaver verte-
brae provided for registration testing in van de Kraats et al. (2005)
the inexact nature of the standard to which we are compar-
ing, lever effects due to large areas (i.e. sky) without overlap,
and residual lens distortions in the images.
Once again, there is a clear and statistically significant
ranking in terms of speed, with the generalized ESM ap-
proach being the slowest, followed by the classic forwards
additive approach, with the generalized IC approach being
the fastest in all cases except the BFGS optimizer for MI.
The original IC method does not perform as well in terms of
speed on these images, and in the case of Newton-Raphson
optimization of NCC is actually slower than the classical,
forwards additive, method (see Fig. 10(a)). This slow perfor-
mance is most likely related to its even poorer performance
in terms of reliability.
The generalized IC algorithm performed as well in terms
of failure rate as the classical algorithm, with only the New-
ton-Raphson, NCC case showing an ambiguous difference
in Fig. 10(b). However, the original IC method proved very
unreliable on these images, showing a striking increase in
failure rate compared to all other algorithms. We speculate
that this was due to residual lens distortions invalidating the
compositional update of the original algorithm. As shown in
Fig. 3, the compositional update traces out a curved path in
the moving image parameter space. The validity of this path
is determined by the validity of the transformation model.
Converting the derivative, on the other hand, only relies on
the validity of the tangent space local to the current point on
the manifold of transformations. These results indicate that
this is a more robust approach.
5.4 Experiment Set 4: Real Rigid 3D Transformations
In (van de Kraats et al. 2005), several 3D volumes in dif-
ferent modalities intended for use in testing of registration
algorithms are described. They registered these images to a
subpixel level of accuracy using a mutual information based
technique, which was verified against fiducial markers (see
van de Kraats et al. (2005) for details). This provides a sil-
ver standard for registration, in that the registration has been
arrived at independently of the research described here. The
datasets consist of volumetric images of cadaver vertebrae
obtained with magnetic resonance (MR), computed tomog-
raphy (CT), and 3D X-Ray (3DRX) imaging techniques.
Page 15
Int J Comput Vis (2010) 87: 191–212 205
Fig. 12 (Color online) Experimental results for real 3D rigid registra-
tions with mutual information. The optimization algorithms are Broy-
den-Fletcher-Goldfarb-Shanno (BFGS) and Newton-Raphson (N). For
each algorithm, bars are shown for the following methods from left to
right: 1: ESM (Red) 2: Classical Forwards Additive (Black) 3: Gen-
eralized IC (Blue) and 4: Original IC (Green). Dark bars are statisti-
cally significantly different from classical, light bars are not signifi-
cantly different and outline bars are ambiguous (see text). No original
IC method was implemented for the BFGS optimizer
Two triplets of registered images are provided. Figure 11
shows slices through the center of the volumes for one of
the sets of images, the other is similar. Mask images are also
provided by (van de Kraats et al. 2005) to remove the ex-
traneous surrounding area from the registration. These mask
images serve much the same function as the cropping in Ex-
periment Set 1.
The original images were provided at resolutions of 1 ×
0.75 × 0.75 mm for MR, 0.3 × 0.49 × 0.3 mm for CT and
1 × 1 × 1 mm for the 3DRX dataset. To make the dataset
sizes and resolutions comparable, we resampled the CT data
to a 0.75 × 0.75 × 0.75 mm grid.
As the images are of greatly different modalities, MI is
the only appropriate measure to use for registration. As in
experiment 2, we considered gradient descent too slow a
method to use on these large datasets, so only the BFGS
and Newton-Raphson methods were used. The images were
registered from starting positions ranging from 4 to 9 mm
in mTRE from the true registration solution. Two multires-
olution levels were used in the registration, and in this case
100% of the available pixels were used.
The combined results for all 6 image pairs are presented
in Fig. 12. Like the previous experiment, there are some
variations in image size and character, and the combined re-
sults should be interpreted with that in mind. The timing re-
sults are consistent with the previous experiments, with our
generalized IC method being fastest. However, in this case
the ESM method proves to be much slower than the other
methods. Upon examination of the data, we determined that
this is occurring because the optimizer has a tendency to
move from the first to the second multiresolution level too
early.
In this particular case, the inverse compositional and
ESM methods both show a statistically significant improve-
ment in accuracy over the classical, forwards additive ap-
proach. However, as this was not the case for other datasets,
it is our opinion that this is not a general trend, and is most
likely due to differences in the noise levels in the gradients
of this particular dataset. Finally, as in the case of the real
homography experiment, the failure rate of the original in-
verse compositional method is significantly higher on this
data.
5.5 Discussion
Overall, the results of each separate experiment are con-
sistent. In each case, the generalized inverse compositional
method runs faster than the classic, forwards additive meth-
od, and the generalized ESM method runs a bit slower. We
did not see a significant drop in the number of evaluations re-
quired when using the generalized ESM. This is most likely
related to the curvature of the path through the image do-
main, as discussed in Sect. 2.4. However, we did observe an
increase in reliability over the other algorithms when start-
ing far from the correct transform (see Fig. 6).
In all cases, the generalized IC method is equivalent or
better than the original IC method in terms of both speed
and reliability. Its major advantage, is the ease with which
new optimization methods, or existing optimization libraries
can be integrated with it. There may be a slight decrease in
reliability far from the correct transform when using either
the generalized or the original IC method, which is consis-
tent with results found by other authors (e.g. Dowson and
Bowden 2008) for the original IC method. However, in the
real homography case, the original IC method proved very
unreliable. We speculate that this is due to the transforms in
question not truly being homographies due to residual lens
distortion.
Table 2 summarizes the overall speed up obtained for
each experiment, and for all experiments combined. Overall,
our generalized method obtains speed improvements quite
near the maximum possible. The benefit is most pronounced
for the MSD metric, and somewhat less pronounced for the
NCC metric—which is consistent with the relative propor-
tion of time (Table 1) spent computing the derivative in each.
The time improvements for the MI measure are also sig-
nificant. For the Newton-Raphson method they are particu-
larly large, which is due to the MI Hessian requiring more
computation than the MSD or NCC Hessians. For the gradi-
ent based optimizers, however, the speed up is not as near the
theoretical maximum as for MSD and NCC. This is due to
three reasons. One being the different balance of computa-
tional burden—MI spends less time proportionally comput-
ing the gradient than the other two methods and thus has less
to gain from the inverse compositional approach. The sec-
ond reason is that our asymmetric implementation requires
some extra computation to compute ∇φf DMI . Finally, the
Fig. 12 (Color online) Experimental results for real 3D rigid registra-
tions with mutual information. The optimization algorithms are Broy-
den-Fletcher-Goldfarb-Shanno (BFGS) and Newton-Raphson (N). For
each algorithm, bars are shown for the following methods from left to
right: 1: ESM (Red) 2: Classical Forwards Additive (Black) 3: Gen-
eralized IC (Blue) and 4: Original IC (Green). Dark bars are statisti-
cally significantly different from classical, light bars are not signifi-
cantly different and outline bars are ambiguous (see text). No original
IC method was implemented for the BFGS optimizer
Two triplets of registered images are provided. Figure 11
shows slices through the center of the volumes for one of
the sets of images, the other is similar. Mask images are also
provided by (van de Kraats et al. 2005) to remove the ex-
traneous surrounding area from the registration. These mask
images serve much the same function as the cropping in Ex-
periment Set 1.
The original images were provided at resolutions of 1 ×
0.75 × 0.75 mm for MR, 0.3 × 0.49 × 0.3 mm for CT and
1 × 1 × 1 mm for the 3DRX dataset. To make the dataset
sizes and resolutions comparable, we resampled the CT data
to a 0.75 × 0.75 × 0.75 mm grid.
As the images are of greatly different modalities, MI is
the only appropriate measure to use for registration. As in
experiment 2, we considered gradient descent too slow a
method to use on these large datasets, so only the BFGS
and Newton-Raphson methods were used. The images were
registered from starting positions ranging from 4 to 9 mm
in mTRE from the true registration solution. Two multires-
olution levels were used in the registration, and in this case
100% of the available pixels were used.
The combined results for all 6 image pairs are presented
in Fig. 12. Like the previous experiment, there are some
variations in image size and character, and the combined re-
sults should be interpreted with that in mind. The timing re-
sults are consistent with the previous experiments, with our
generalized IC method being fastest. However, in this case
the ESM method proves to be much slower than the other
methods. Upon examination of the data, we determined that
this is occurring because the optimizer has a tendency to
move from the first to the second multiresolution level too
early.
In this particular case, the inverse compositional and
ESM methods both show a statistically significant improve-
ment in accuracy over the classical, forwards additive ap-
proach. However, as this was not the case for other datasets,
it is our opinion that this is not a general trend, and is most
likely due to differences in the noise levels in the gradients
of this particular dataset. Finally, as in the case of the real
homography experiment, the failure rate of the original in-
verse compositional method is significantly higher on this
data.
5.5 Discussion
Overall, the results of each separate experiment are con-
sistent. In each case, the generalized inverse compositional
method runs faster than the classic, forwards additive meth-
od, and the generalized ESM method runs a bit slower. We
did not see a significant drop in the number of evaluations re-
quired when using the generalized ESM. This is most likely
related to the curvature of the path through the image do-
main, as discussed in Sect. 2.4. However, we did observe an
increase in reliability over the other algorithms when start-
ing far from the correct transform (see Fig. 6).
In all cases, the generalized IC method is equivalent or
better than the original IC method in terms of both speed
and reliability. Its major advantage, is the ease with which
new optimization methods, or existing optimization libraries
can be integrated with it. There may be a slight decrease in
reliability far from the correct transform when using either
the generalized or the original IC method, which is consis-
tent with results found by other authors (e.g. Dowson and
Bowden 2008) for the original IC method. However, in the
real homography case, the original IC method proved very
unreliable. We speculate that this is due to the transforms in
question not truly being homographies due to residual lens
distortion.
Table 2 summarizes the overall speed up obtained for
each experiment, and for all experiments combined. Overall,
our generalized method obtains speed improvements quite
near the maximum possible. The benefit is most pronounced
for the MSD metric, and somewhat less pronounced for the
NCC metric—which is consistent with the relative propor-
tion of time (Table 1) spent computing the derivative in each.
The time improvements for the MI measure are also sig-
nificant. For the Newton-Raphson method they are particu-
larly large, which is due to the MI Hessian requiring more
computation than the MSD or NCC Hessians. For the gradi-
ent based optimizers, however, the speed up is not as near the
theoretical maximum as for MSD and NCC. This is due to
three reasons. One being the different balance of computa-
tional burden—MI spends less time proportionally comput-
ing the gradient than the other two methods and thus has less
to gain from the inverse compositional approach. The sec-
ond reason is that our asymmetric implementation requires
some extra computation to compute ∇φf DMI . Finally, the
Page 16
206 Int J Comput Vis (2010) 87: 191–212
Table 2 Overall speedup found when using the generalized IC
method. These show the percentage speedup ( TimeSTD−TimeICTimeSTD × 100%)
when compared to the classical (forwards additive) method. The totals
are averages of these percentages
Measure MSD NCC MI
1a: Syn. Hom. GD 29.6% 27.1% 14.2%
1a: Syn. Hom. BFGS 24% 21.5% 0.8%
1a: Syn. Hom. N 29.3% 30% 45.3%
1b: Syn. Hom. GD 29.7% 31.4% 31.5%
1b: Syn. Hom. BFGS 23.6% 23.6% −2.1%
1b: Syn. Hom. N 30.6% 37.1% 53.3%
2: Syn. R. 3D BFGS 52% – 33.3%
2: Syn. R. 3D N 17.5% – 21.2%
3: Homog. GD – 16.1% 10.7%
3: Homog. BFGS – 17.4% 1.5%
3: Homog. N – 19.2% 39.5%
4: Rigid 3D BFGS – – 19.5%
4: Rigid 3D N – – 63.6%
Total: GD 29.6% 24.9% 18.8%
Total: BFGS 33.2% 20.8% 8.8%
Total: N 25.8% 28.8% 44.6%
Total: ALL 29.5% 24.8% 24.9%
BFGS optimizer had some difficulty optimizing the IC MI
due, again, to the asymmetry. An extension of the symmetric
MILK mutual information implementation of (Dowson and
Bowden 2008) would likely provide some further efficiency
gains.
From the results, it seems clear that the advantages of the
generalized ESM method in this context, are not its speed
but its greater reliability. Table 3 shows the change in failure
rate for each experiment, and the overall total. The results
for experiment 1b, which was designed to be particularly
difficult, show significantly more reliable results when us-
ing ESM. The improvement is somewhat hidden in the over-
all results (bottom rows, Table 3) because most experiments
were feasible for all the algorithms under test, and thus the
greater reliability of generalized ESM never came into play.
6 Conclusions
In this paper, we have presented generalizations of the pop-
ular inverse compositional (IC) and efficient second order
methods (ESM) for image registration. Our generalization
is based on the insight that these methods can be viewed
as different ways of computing the derivative and Hessian
of the image difference measure. The inverse compositional
method provided us with an efficient way to compute ∇φf D
and Hφf D when what standard optimizers need is ∇φmD
and HφmD. ESM, on the other hand, computes more accu-
rate estimates of ∇φD and HφD by using both those with
Table 3 Change in failure rates
found when using the
generalized ESM method. These
show the change in failure rate
percentage (FailESM − FailSTD)
when compared to the classical
(forwards additive) method. The
percentage change of the failure
rate ( FailESM−FailSTDFailSTD × 100%) is
shown in parentheses. The totals
are computed by first averaging
the failure rates giving each
experiment equal weight, and
then computing the change
Measure MSD NCC MI
1a: Syn. Hom. GD 0.1% (11.1%) −0.2% (−11.1%) −0.3% (−3.9%)
1a: Syn. Hom. BFGS 0.0% (0.0%) 0.0% (0.0%) −0.2% (−100.0%)
1a: Syn. Hom. N −0.2% (−60.0%) −0.1% (−20.0%) 0.5% (200.0%)
1b: Syn. Hom. GD −1.0% (−14.3%) −5.4% (−36.0%) −5.6% (−8.0%)
1b: Syn. Hom. BFGS −1.6% (−25.8%) −4.2% (−42.4%) −12.1% (−45.3%)
1b: Syn. Hom. N −1.8% (−15.4%) −2.6% (−15.3%) −5.0% (−14.1%)
2: Syn. R. 3D BFGS 1.3% (100.0%) − (−) 0.0% (0.0%)
2: Syn. R. 3D N 0.0% (0.0%) − (−) 0.0% (0.0%)
3: Homog. GD − (−) −2.8% (−21.4%) −1.9% (−9.7%)
3: Homog. BFGS − (−) −4.4% (−28.6%) 3.1% (9.1%)
3: Homog. N − (−) −1.6% (−10.9%) 2.5% (8.2%)
4: Rigid 3D BFGS − (−) − (−) −0.8% (−15.4%)
4: Rigid 3D N − (−) − (−) 0.0% (0.0%)
Total: GD −0.5% (−13.2%) −2.8% (−28.3%) −2.6% (−8.0%)
Total: BFGS −0.1% (−4.0%) −2.9% (−34.5%) −1.6% (−14.4%)
Total: N −0.7% (−17.5%) −1.4% (−13.2%) −0.4% (−2.8%)
Total: ALL −0.4% (−11.8%) −2.3% (−24.0%) −1.3% (−7.6%)
Table 2 Overall speedup found when using the generalized IC
method. These show the percentage speedup ( TimeSTD−TimeICTimeSTD × 100%)
when compared to the classical (forwards additive) method. The totals
are averages of these percentages
Measure MSD NCC MI
1a: Syn. Hom. GD 29.6% 27.1% 14.2%
1a: Syn. Hom. BFGS 24% 21.5% 0.8%
1a: Syn. Hom. N 29.3% 30% 45.3%
1b: Syn. Hom. GD 29.7% 31.4% 31.5%
1b: Syn. Hom. BFGS 23.6% 23.6% −2.1%
1b: Syn. Hom. N 30.6% 37.1% 53.3%
2: Syn. R. 3D BFGS 52% – 33.3%
2: Syn. R. 3D N 17.5% – 21.2%
3: Homog. GD – 16.1% 10.7%
3: Homog. BFGS – 17.4% 1.5%
3: Homog. N – 19.2% 39.5%
4: Rigid 3D BFGS – – 19.5%
4: Rigid 3D N – – 63.6%
Total: GD 29.6% 24.9% 18.8%
Total: BFGS 33.2% 20.8% 8.8%
Total: N 25.8% 28.8% 44.6%
Total: ALL 29.5% 24.8% 24.9%
BFGS optimizer had some difficulty optimizing the IC MI
due, again, to the asymmetry. An extension of the symmetric
MILK mutual information implementation of (Dowson and
Bowden 2008) would likely provide some further efficiency
gains.
From the results, it seems clear that the advantages of the
generalized ESM method in this context, are not its speed
but its greater reliability. Table 3 shows the change in failure
rate for each experiment, and the overall total. The results
for experiment 1b, which was designed to be particularly
difficult, show significantly more reliable results when us-
ing ESM. The improvement is somewhat hidden in the over-
all results (bottom rows, Table 3) because most experiments
were feasible for all the algorithms under test, and thus the
greater reliability of generalized ESM never came into play.
6 Conclusions
In this paper, we have presented generalizations of the pop-
ular inverse compositional (IC) and efficient second order
methods (ESM) for image registration. Our generalization
is based on the insight that these methods can be viewed
as different ways of computing the derivative and Hessian
of the image difference measure. The inverse compositional
method provided us with an efficient way to compute ∇φf D
and Hφf D when what standard optimizers need is ∇φmD
and HφmD. ESM, on the other hand, computes more accu-
rate estimates of ∇φD and HφD by using both those with
Table 3 Change in failure rates
found when using the
generalized ESM method. These
show the change in failure rate
percentage (FailESM − FailSTD)
when compared to the classical
(forwards additive) method. The
percentage change of the failure
rate ( FailESM−FailSTDFailSTD × 100%) is
shown in parentheses. The totals
are computed by first averaging
the failure rates giving each
experiment equal weight, and
then computing the change
Measure MSD NCC MI
1a: Syn. Hom. GD 0.1% (11.1%) −0.2% (−11.1%) −0.3% (−3.9%)
1a: Syn. Hom. BFGS 0.0% (0.0%) 0.0% (0.0%) −0.2% (−100.0%)
1a: Syn. Hom. N −0.2% (−60.0%) −0.1% (−20.0%) 0.5% (200.0%)
1b: Syn. Hom. GD −1.0% (−14.3%) −5.4% (−36.0%) −5.6% (−8.0%)
1b: Syn. Hom. BFGS −1.6% (−25.8%) −4.2% (−42.4%) −12.1% (−45.3%)
1b: Syn. Hom. N −1.8% (−15.4%) −2.6% (−15.3%) −5.0% (−14.1%)
2: Syn. R. 3D BFGS 1.3% (100.0%) − (−) 0.0% (0.0%)
2: Syn. R. 3D N 0.0% (0.0%) − (−) 0.0% (0.0%)
3: Homog. GD − (−) −2.8% (−21.4%) −1.9% (−9.7%)
3: Homog. BFGS − (−) −4.4% (−28.6%) 3.1% (9.1%)
3: Homog. N − (−) −1.6% (−10.9%) 2.5% (8.2%)
4: Rigid 3D BFGS − (−) − (−) −0.8% (−15.4%)
4: Rigid 3D N − (−) − (−) 0.0% (0.0%)
Total: GD −0.5% (−13.2%) −2.8% (−28.3%) −2.6% (−8.0%)
Total: BFGS −0.1% (−4.0%) −2.9% (−34.5%) −1.6% (−14.4%)
Total: N −0.7% (−17.5%) −1.4% (−13.2%) −0.4% (−2.8%)
Total: ALL −0.4% (−11.8%) −2.3% (−24.0%) −1.3% (−7.6%)
Page 17
Int J Comput Vis (2010) 87: 191–212 207
respect to φf and those with respect to φm. Both methods
previously placed significant restrictions on how the rest of
the image registration problem had to be formulated. Our
generalization shows that the chain rule can be used to con-
vert the derivatives from one variable to another. The advan-
tages of these methods can then be obtained in more general
image registration contexts.
We have implemented and experimented with our ap-
proach using three popular image similarity measures, the
Mean Squared Difference (MSD), Normalized Cross-Corr-
elation (NCC), and Mutual Information (MI), and three
typical optimizers, gradient descent (GD), quasi-Newton
(BFGS) and Newton-Raphson (N) methods. The IC method
can achieve a maximal speed up of about 1/3 and our gen-
eralized method achieves nearly that for the MSD and NCC
image difference measures on these optimizers. The speedup
is less, but still significant, for the MI measure. This measure
was more difficult mainly due to our asymmetric formula-
tion.
We found that the generalized ESM method does not im-
prove results in terms of speed, but does improve the relia-
bility of the optimization. This can be understood intuitively
as the ESM method combines estimates of the optimization
step from both the inverse and forward approaches. This es-
timate is, particularly in difficult cases, a better estimate on
average than either estimate alone.
A further advantage of our generalization, is that the vari-
ous approaches can be easily combined, which is something
we wish to address in future work. It is probable that using
a few of the relatively expensive ESM derivative calcula-
tions early in the optimization process would improve the
optimizers’ capture radius, while using IC derivative calcu-
lations later in the optimization process would improve the
computational efficiency. In this way our generalization can
allow us to get the best of both worlds.
Acknowledgements The authors would like to thank CLUMEQ
(Consortium Laval, Université du Québec, McGill and Eastern Que-
bec) and the RQCHP (Réseau québécois de calcul de haute perfor-
mance) for providing computational resources for this project.
Appendix A: Derivative of φf with Respect to φm for
Common Transforms
In this section, we show how the ∂φf∂φm can be computed for
the transforms used in the paper. This involves a number of
multidimensional objects which can be considered as ma-
trices, and as vectorizations of these matrices. To be clear
in matters of dimensionality we will use Harshman’s ma-
trix notation (Harshman 2001). In this notation, all array di-
mensions are represented as subscripts following the matrix
name. Adjacent matrices with the same subscript indicates
a contraction (multiplication and summation) along the cor-
responding dimension. (This is also known as the Einstein
summation convention.) For example,
AIJ BI ′J =
J
∑
j=1
AijBi′j = A · B
T .
Note that I and I ′ are not the same index and thus no sum-
mation is performed on them. Indexes that are grouped with
parentheses represent the vectorization of those two dimen-
sions into one longer dimension. For example,
if AIJ =
[
1 2
3 4
]
then A(IJ ) =
⎡
⎢
⎢
⎣
1
2
3
4
⎤
⎥
⎥
⎦
.
To avoid confusion between subscripts for summation, and
modifiers to the names of variables, a modified variable
name such as φm will be written as φmP in this notation. For
brevity, when no summation or multiplication is involved,
such as when a vector is the argument of a function, the sub-
script will be omitted (e.g. MIJ (φ)). Finally, δIJK repre-
sents the identity, that is
δijk =
{
1 when i = j = k,
0 otherwise.
Several transforms in common use in image registration
problems have a matrix transform of homogeneous coordi-
nates at their core. That is, the transformation is carried out
using the following equation:
[
λxˆD
λ
]
I
= TIJ MJK(φ)T
−1
KL ·
[
cxD
1
]
L
, (15)
where xD are the original coordinates, xˆD are the trans-
formed coordinates, TIJ is a translation that manages the
effective center of the transform, and MJK(φ) is a func-
tion returning the homogeneous transformation matrix cor-
responding to the parameters, φ. The transforms are com-
posed through matrix multiplication, and inverted by matrix
inversion. We are interested in the parameters of the fixed
transform φf as a function of the parameters of the moving
transform, φm. Using the matrix operations, we get
φf
P
(φm) = φm−1 ◦ ˆφm
= φP (MIJ (φm)MJK( ˆφm))
where MIJ (φ) is the matrix that corresponds to a set of pa-
rameters, φ; ˆφm is the starting transform, and φP (M) is
the function that returns the transformation parameters for
respect to φf and those with respect to φm. Both methods
previously placed significant restrictions on how the rest of
the image registration problem had to be formulated. Our
generalization shows that the chain rule can be used to con-
vert the derivatives from one variable to another. The advan-
tages of these methods can then be obtained in more general
image registration contexts.
We have implemented and experimented with our ap-
proach using three popular image similarity measures, the
Mean Squared Difference (MSD), Normalized Cross-Corr-
elation (NCC), and Mutual Information (MI), and three
typical optimizers, gradient descent (GD), quasi-Newton
(BFGS) and Newton-Raphson (N) methods. The IC method
can achieve a maximal speed up of about 1/3 and our gen-
eralized method achieves nearly that for the MSD and NCC
image difference measures on these optimizers. The speedup
is less, but still significant, for the MI measure. This measure
was more difficult mainly due to our asymmetric formula-
tion.
We found that the generalized ESM method does not im-
prove results in terms of speed, but does improve the relia-
bility of the optimization. This can be understood intuitively
as the ESM method combines estimates of the optimization
step from both the inverse and forward approaches. This es-
timate is, particularly in difficult cases, a better estimate on
average than either estimate alone.
A further advantage of our generalization, is that the vari-
ous approaches can be easily combined, which is something
we wish to address in future work. It is probable that using
a few of the relatively expensive ESM derivative calcula-
tions early in the optimization process would improve the
optimizers’ capture radius, while using IC derivative calcu-
lations later in the optimization process would improve the
computational efficiency. In this way our generalization can
allow us to get the best of both worlds.
Acknowledgements The authors would like to thank CLUMEQ
(Consortium Laval, Université du Québec, McGill and Eastern Que-
bec) and the RQCHP (Réseau québécois de calcul de haute perfor-
mance) for providing computational resources for this project.
Appendix A: Derivative of φf with Respect to φm for
Common Transforms
In this section, we show how the ∂φf∂φm can be computed for
the transforms used in the paper. This involves a number of
multidimensional objects which can be considered as ma-
trices, and as vectorizations of these matrices. To be clear
in matters of dimensionality we will use Harshman’s ma-
trix notation (Harshman 2001). In this notation, all array di-
mensions are represented as subscripts following the matrix
name. Adjacent matrices with the same subscript indicates
a contraction (multiplication and summation) along the cor-
responding dimension. (This is also known as the Einstein
summation convention.) For example,
AIJ BI ′J =
J
∑
j=1
AijBi′j = A · B
T .
Note that I and I ′ are not the same index and thus no sum-
mation is performed on them. Indexes that are grouped with
parentheses represent the vectorization of those two dimen-
sions into one longer dimension. For example,
if AIJ =
[
1 2
3 4
]
then A(IJ ) =
⎡
⎢
⎢
⎣
1
2
3
4
⎤
⎥
⎥
⎦
.
To avoid confusion between subscripts for summation, and
modifiers to the names of variables, a modified variable
name such as φm will be written as φmP in this notation. For
brevity, when no summation or multiplication is involved,
such as when a vector is the argument of a function, the sub-
script will be omitted (e.g. MIJ (φ)). Finally, δIJK repre-
sents the identity, that is
δijk =
{
1 when i = j = k,
0 otherwise.
Several transforms in common use in image registration
problems have a matrix transform of homogeneous coordi-
nates at their core. That is, the transformation is carried out
using the following equation:
[
λxˆD
λ
]
I
= TIJ MJK(φ)T
−1
KL ·
[
cxD
1
]
L
, (15)
where xD are the original coordinates, xˆD are the trans-
formed coordinates, TIJ is a translation that manages the
effective center of the transform, and MJK(φ) is a func-
tion returning the homogeneous transformation matrix cor-
responding to the parameters, φ. The transforms are com-
posed through matrix multiplication, and inverted by matrix
inversion. We are interested in the parameters of the fixed
transform φf as a function of the parameters of the moving
transform, φm. Using the matrix operations, we get
φf
P
(φm) = φm−1 ◦ ˆφm
= φP (MIJ (φm)MJK( ˆφm))
where MIJ (φ) is the matrix that corresponds to a set of pa-
rameters, φ; ˆφm is the starting transform, and φP (M) is
the function that returns the transformation parameters for
Page 18
208 Int J Comput Vis (2010) 87: 191–212
a given matrix, M . So long as the transform center does not
change, it may be ignored for this discussion, since
[
TIJ MJKT
−1
KL
]
−1[
TLMNMNT
−1
NP
]
=
[
TIJ M
−1
JKNKNT
−1
NP
]
.
Then the derivative of interest is
∂φf
P
∂φm
Q
=
∂φf
P
∂M(IJ )(φm)
∂M(IJ )(φm)
∂φm
Q
.
Noting that the derivative of the inverse of a matrix function
of a scalar parameter a is:
∂M(IL)(a)−1
∂a
= MIJ (a)
−1 ∂MJK(a)
∂a
MKL(a)
−1
and that
MIK(φf ) = MIJ (φm)
−1MJK( ˆφm),
we obtain
∂M(IM)(φf )
∂φm
Q
= MIJ (φm)
−1
·
∂MJK(φm)
∂φm
Q
· MKL(φm)
−1
· MLM( ˆφm). (16)
Note the indexes carefully—the final result is a matrix of
size (IJ ) × Q formed by stacking up the vectorized results
of Q multiplications of 2D matrices of size I × J . There-
fore to compute ∂φf P∂φm
Q
for a matrix based transform we must
simply know ∂φP∂M(IJ ) and
∂M(IJ )
∂φP
for its particular parameter-
ization.
A.1 Homogeneous Transform
The parameterization of the homography used here is
one commonly used in computer vision (e.g. Baker and
Matthews 2004), shown here for the 2D case. (The 3D
case—not used in this paper—can easily be developed by
adding columns and rows corresponding to the additional
dimension.)
M(φ) =
⎡
⎣
φ1 φ2 φ5
φ3 φ4 φ6
φ7 φ8 1
⎤
⎦ .
There is a linear relationship between MIJ and φP which
we can represent as a matrix equation
MIJ (φ)
= A(IJ )P φP
=
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎣
δDD′ 0DD′ 0D 0D 0D 0DD′
0D′ 0D′ 0 1 0 0D′
0DD′ δDD′ 0D 0D 0D 0DD′
0D′ 0D′ 0 0 1 0D′
0DD′ 0DD′ 0D 0D 0D δDD′
0D′ 0D′ 0 0 0 0D′
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎦
(IJ )P
· φP .
(17)
Here we will consistently use the subscripts D, (D′) to indi-
cate there are a number of rows (columns) equal to the space
dimension.
To do the reverse and obtain φ from a matrix, the matrix
must first be scaled so that its lower right element is one.
Then the inverse of Eq. 17 is clearly
φP (M) =
1
M3,3
A+P(IJ )M(IJ )
where A+P(IJ ) is the pseudoinverse of A(IJ )P , so
φP (M) =
⎡
⎢
⎢
⎢
⎢
⎣
δDD′ 0D 0DD′ 0D 0DD′ 0D
0DD′ 0D δDD′ 0D 0DD′ 0D
0D′ 1 0D′ 0 0D′ 0
0D′ 0 0D′ 1 0D′ 0
0DD′ 0D 0DD′ 0D δDD′ 0D
⎤
⎥
⎥
⎥
⎥
⎦
(IJ )P
×
M(IJ )
M3,3
.
Then,
∂φP
∂M(IJ )
=
1
M3,3
A(IJ )P +
∂M3,3
∂M(IJ )
A+P(IJ )M(IJ )
and
∂φP
∂M(IJ )
=
1
M3,3
·
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
δDD′ 0D 0DD′ 0D 0DD′ −1M3,3 M 1,2D
0DD′ 0D δDD′ 0D 0DD′ −1M3,3 M 4,5D
0D′ 1 0D′ 0 0D′ −1M3,3 M3
0D′ 0 0D′ 1 0D′ −1M3,3 M4
0DD′ 0D 0DD′ 0D δDD′ −1M3,3 M 7,8D
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
(IJ )P
.
The necessary derivative, ∂φf P∂φm
Q
can now be calculated
using Eq. 16.
A.2 Rigid 3D Transform
A rigid 3D transform can also be represented as a matrix.
However, the rigid transforms form a subgroup of the ma-
trix transforms, and consequently cannot be parameterized
a given matrix, M . So long as the transform center does not
change, it may be ignored for this discussion, since
[
TIJ MJKT
−1
KL
]
−1[
TLMNMNT
−1
NP
]
=
[
TIJ M
−1
JKNKNT
−1
NP
]
.
Then the derivative of interest is
∂φf
P
∂φm
Q
=
∂φf
P
∂M(IJ )(φm)
∂M(IJ )(φm)
∂φm
Q
.
Noting that the derivative of the inverse of a matrix function
of a scalar parameter a is:
∂M(IL)(a)−1
∂a
= MIJ (a)
−1 ∂MJK(a)
∂a
MKL(a)
−1
and that
MIK(φf ) = MIJ (φm)
−1MJK( ˆφm),
we obtain
∂M(IM)(φf )
∂φm
Q
= MIJ (φm)
−1
·
∂MJK(φm)
∂φm
Q
· MKL(φm)
−1
· MLM( ˆφm). (16)
Note the indexes carefully—the final result is a matrix of
size (IJ ) × Q formed by stacking up the vectorized results
of Q multiplications of 2D matrices of size I × J . There-
fore to compute ∂φf P∂φm
Q
for a matrix based transform we must
simply know ∂φP∂M(IJ ) and
∂M(IJ )
∂φP
for its particular parameter-
ization.
A.1 Homogeneous Transform
The parameterization of the homography used here is
one commonly used in computer vision (e.g. Baker and
Matthews 2004), shown here for the 2D case. (The 3D
case—not used in this paper—can easily be developed by
adding columns and rows corresponding to the additional
dimension.)
M(φ) =
⎡
⎣
φ1 φ2 φ5
φ3 φ4 φ6
φ7 φ8 1
⎤
⎦ .
There is a linear relationship between MIJ and φP which
we can represent as a matrix equation
MIJ (φ)
= A(IJ )P φP
=
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎣
δDD′ 0DD′ 0D 0D 0D 0DD′
0D′ 0D′ 0 1 0 0D′
0DD′ δDD′ 0D 0D 0D 0DD′
0D′ 0D′ 0 0 1 0D′
0DD′ 0DD′ 0D 0D 0D δDD′
0D′ 0D′ 0 0 0 0D′
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎦
(IJ )P
· φP .
(17)
Here we will consistently use the subscripts D, (D′) to indi-
cate there are a number of rows (columns) equal to the space
dimension.
To do the reverse and obtain φ from a matrix, the matrix
must first be scaled so that its lower right element is one.
Then the inverse of Eq. 17 is clearly
φP (M) =
1
M3,3
A+P(IJ )M(IJ )
where A+P(IJ ) is the pseudoinverse of A(IJ )P , so
φP (M) =
⎡
⎢
⎢
⎢
⎢
⎣
δDD′ 0D 0DD′ 0D 0DD′ 0D
0DD′ 0D δDD′ 0D 0DD′ 0D
0D′ 1 0D′ 0 0D′ 0
0D′ 0 0D′ 1 0D′ 0
0DD′ 0D 0DD′ 0D δDD′ 0D
⎤
⎥
⎥
⎥
⎥
⎦
(IJ )P
×
M(IJ )
M3,3
.
Then,
∂φP
∂M(IJ )
=
1
M3,3
A(IJ )P +
∂M3,3
∂M(IJ )
A+P(IJ )M(IJ )
and
∂φP
∂M(IJ )
=
1
M3,3
·
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
δDD′ 0D 0DD′ 0D 0DD′ −1M3,3 M 1,2D
0DD′ 0D δDD′ 0D 0DD′ −1M3,3 M 4,5D
0D′ 1 0D′ 0 0D′ −1M3,3 M3
0D′ 0 0D′ 1 0D′ −1M3,3 M4
0DD′ 0D 0DD′ 0D δDD′ −1M3,3 M 7,8D
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
(IJ )P
.
The necessary derivative, ∂φf P∂φm
Q
can now be calculated
using Eq. 16.
A.2 Rigid 3D Transform
A rigid 3D transform can also be represented as a matrix.
However, the rigid transforms form a subgroup of the ma-
trix transforms, and consequently cannot be parameterized
Page 19
Int J Comput Vis (2010) 87: 191–212 209
in terms of matrix elements. This renders the derivative quite
non-linear in the parameters.
Here we develop the 3D case, for an Euler angle parame-
terization of 3D rigid transforms, which is used in the paper.
Euler angle parameterizations present flaws when large ro-
tations (near 90°) are present, but that is not the case for the
problems of interest in this paper. The matrix for a given set
of parameters is given by
AIJ (φ) =
[
RZ (φ3)RX (φ1)RY (φ2) φ4,5,6
0D′ 1
]
IJ
where Rα(φ) is a rotation matrix around the α axis by an
angle φ.
For brevity in what follows, we define s1 = sin(φ1), c1 =
cos(φ1) and so forth. The matrix AIJ (φ) can be expanded
as
AIJ (φ)
=
⎡
⎢
⎢
⎣
c3c2 − s3s1s2 −s3c1 c3s2 + s3s1c2 φ4
s3c2 + c3s1s2 c3c1 s3s2 − c3s1c2 φ5
−c1s2 s1 c1c2 φ6
0 0 0 1
⎤
⎥
⎥
⎦
IJ
.
If c1 is small then
φP (A) =
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
sin−1(A3,2)
tan−1( A3,3
cos(sin−1(A3,2))
, −A3,1
cos(sin−1(A3,2))
)
tan−1( A2,2
cos(sin−1(A3,2))
, −A1,2
cos(sin−1(A3,2))
)
A1,4
A2,4
A3,4
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
; (18)
otherwise,
φP (A) =
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎣
sin−1(A3,2)
tan−1(A1,1,A2,1)
0
A1,4
A2,4
A3,4
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎦
. (19)
(We have used the criterion that |A3,2| < 0.99999999875 to
choose between Eqs. (18) and (19).)
Here tan−1(c, s) is the inverse tangent function defined
for the full circle, where c and s are the cosine and sine of
the angle in question:
tan−1(c, s) =
⎧
⎪
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎪
⎩
sin−1(s) ∀ |c| > |s|, c > 0,
π − sin−1(s) ∀ |c| > |s|, c < 0,
cos−1(c) ∀ |c| < |s|, s > 0,
− cos−1(c) ∀ |c| < |s|, s < 0.
The derivative of tan−1(c, s), where c and s are dependent
functions of some other variable, x, is
∂ tan−1(c(x), s(x))
∂x
=
⎧
⎨
⎩
1
c(x)
∂c(x)
∂x ∀ |c(x)| > |s(x)|,
−1
s(x)
∂s(x)
∂x ∀ |c(x)| < |s(x)|.
We can now compute the derivatives of A(φ) and φ(A)
∂A(IJ )(φ)
∂φP
=
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
−s3s2c1 −c3s2 − s3s1c2 −s3c2 − c3s1s2 0 0 0
s3s1 0 −c3c1 0 0 0
s3c1c2 c3c2 − s3s1s2 −s3s2 + c3s1c2 0 0 0
0 0 0 1 0 0
c3s2c1 −s3s2 + c3s1c2 c3c2 − s3s1s2 0 0 0
−c3s1 0 −s3c1 0 0 0
−c3c1c2 s3c2 + c3s1s2 c3s2 + s3s1c2 0 0 0
0 0 0 0 1 0
s1s2 −c1c2 0 0 0 0
c1 0 0 0 0 0
−c2s1 −c1s2 0 0 0 0
0 0 0 0 0 1
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
(IJ )P
.
(20)
Let λ and ¯λ be indicator functions where λi = 1 iff |ci | ≥
|si | and ¯λi = (1 − λi).
∂φP (A)
∂A(IJ )
T
=
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
0 0 0 0 0 0
0 0 −λ3c1c3 0 0 0
0 0 0 0 0 0
0 0 0 1 0 0
0 0 0 0 0 0
0 0 −¯λ3c1s3 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0
0 −λ2c1c2 0 0 0 0
1
c1
p q 0 0 0
0 −¯λ2c1s2 0 0 0 0
0 0 0 0 0 1
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
P(IJ )
(21)
where
p =
λ2A3,1s1
c2|c
3
1|
+
¯λ2A3,3s1
s2|c
3
1|
and
q =
λ3A1,2s1
c3|c
3
1|
+
¯λ3A2,2s1
s3|c
3
1|
Equation 18 is valid as long as the angles involved do not
approach 90 degrees. However, this is not a concern in this
work since we are only ever evaluating Eq. 21 at φf = 0.
At this point, all the cosine terms become one, and the sine
in terms of matrix elements. This renders the derivative quite
non-linear in the parameters.
Here we develop the 3D case, for an Euler angle parame-
terization of 3D rigid transforms, which is used in the paper.
Euler angle parameterizations present flaws when large ro-
tations (near 90°) are present, but that is not the case for the
problems of interest in this paper. The matrix for a given set
of parameters is given by
AIJ (φ) =
[
RZ (φ3)RX (φ1)RY (φ2) φ4,5,6
0D′ 1
]
IJ
where Rα(φ) is a rotation matrix around the α axis by an
angle φ.
For brevity in what follows, we define s1 = sin(φ1), c1 =
cos(φ1) and so forth. The matrix AIJ (φ) can be expanded
as
AIJ (φ)
=
⎡
⎢
⎢
⎣
c3c2 − s3s1s2 −s3c1 c3s2 + s3s1c2 φ4
s3c2 + c3s1s2 c3c1 s3s2 − c3s1c2 φ5
−c1s2 s1 c1c2 φ6
0 0 0 1
⎤
⎥
⎥
⎦
IJ
.
If c1 is small then
φP (A) =
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
sin−1(A3,2)
tan−1( A3,3
cos(sin−1(A3,2))
, −A3,1
cos(sin−1(A3,2))
)
tan−1( A2,2
cos(sin−1(A3,2))
, −A1,2
cos(sin−1(A3,2))
)
A1,4
A2,4
A3,4
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
; (18)
otherwise,
φP (A) =
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎣
sin−1(A3,2)
tan−1(A1,1,A2,1)
0
A1,4
A2,4
A3,4
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎦
. (19)
(We have used the criterion that |A3,2| < 0.99999999875 to
choose between Eqs. (18) and (19).)
Here tan−1(c, s) is the inverse tangent function defined
for the full circle, where c and s are the cosine and sine of
the angle in question:
tan−1(c, s) =
⎧
⎪
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎪
⎩
sin−1(s) ∀ |c| > |s|, c > 0,
π − sin−1(s) ∀ |c| > |s|, c < 0,
cos−1(c) ∀ |c| < |s|, s > 0,
− cos−1(c) ∀ |c| < |s|, s < 0.
The derivative of tan−1(c, s), where c and s are dependent
functions of some other variable, x, is
∂ tan−1(c(x), s(x))
∂x
=
⎧
⎨
⎩
1
c(x)
∂c(x)
∂x ∀ |c(x)| > |s(x)|,
−1
s(x)
∂s(x)
∂x ∀ |c(x)| < |s(x)|.
We can now compute the derivatives of A(φ) and φ(A)
∂A(IJ )(φ)
∂φP
=
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
−s3s2c1 −c3s2 − s3s1c2 −s3c2 − c3s1s2 0 0 0
s3s1 0 −c3c1 0 0 0
s3c1c2 c3c2 − s3s1s2 −s3s2 + c3s1c2 0 0 0
0 0 0 1 0 0
c3s2c1 −s3s2 + c3s1c2 c3c2 − s3s1s2 0 0 0
−c3s1 0 −s3c1 0 0 0
−c3c1c2 s3c2 + c3s1s2 c3s2 + s3s1c2 0 0 0
0 0 0 0 1 0
s1s2 −c1c2 0 0 0 0
c1 0 0 0 0 0
−c2s1 −c1s2 0 0 0 0
0 0 0 0 0 1
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
(IJ )P
.
(20)
Let λ and ¯λ be indicator functions where λi = 1 iff |ci | ≥
|si | and ¯λi = (1 − λi).
∂φP (A)
∂A(IJ )
T
=
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
0 0 0 0 0 0
0 0 −λ3c1c3 0 0 0
0 0 0 0 0 0
0 0 0 1 0 0
0 0 0 0 0 0
0 0 −¯λ3c1s3 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0
0 −λ2c1c2 0 0 0 0
1
c1
p q 0 0 0
0 −¯λ2c1s2 0 0 0 0
0 0 0 0 0 1
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
P(IJ )
(21)
where
p =
λ2A3,1s1
c2|c
3
1|
+
¯λ2A3,3s1
s2|c
3
1|
and
q =
λ3A1,2s1
c3|c
3
1|
+
¯λ3A2,2s1
s3|c
3
1|
Equation 18 is valid as long as the angles involved do not
approach 90 degrees. However, this is not a concern in this
work since we are only ever evaluating Eq. 21 at φf = 0.
At this point, all the cosine terms become one, and the sine
Page 20
210 Int J Comput Vis (2010) 87: 191–212
terms become zero, allowing us to simplify Eq. 21 to
∂φP (A)
∂A(IJ )
T
=
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
0 0 0 0 0 0
0 0 −1 0 0 0
0 0 0 0 0 0
0 0 0 1 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0
0 −1 0 0 0 0
1 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 1
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
P(IJ )
.
As in the previous case, this allows us to calculate ∂φf P∂φm
Q
using Eq. 16.
Appendix B: Derivatives of DMSD , DNCC and DMI
In this appendix, we explicitly calculate the gradient and ap-
proximate Hessian for our image difference measures.
B.1 Mean Squared Difference
The measure is given by:
DMSD =
1
N
N
∑
i=1
[
If (xi) − Im(W(xi,φm))
]2
. (22)
For brevity, we will replace If (xi) with Ifi and Im(W(xi,
φm)) with Imi . The gradient and Gauss-Newton approxima-
tion to the Hessian of DMSD are straightforward. The gradi-
ent is given by:
∇φmDMSD =
2
N
N
∑
i=1
[[Ifi − Imi ] · ∇φmImi ] (23)
and the true Hessian is
HφmDMSD =
2
N
N
∑
i=1
([
∇φmImi · ∇φmI
T
mi
]
+ [Ifi − Imi ] · HφmImi
)
. (24)
Note that gradients are considered to be column vectors, so
the first term is an outer product. The Gauss-Newton approx-
imation is to drop the second term, giving:
HφmDMSD ≈
2
N
N
∑
i=1
[∇φmImi · ∇φmImi ]. (25)
u =
∑
i
[
(Ifi − ¯If )(Imi − ¯Im)
]
,
v =
[
∑
i
[
(Ifi − ¯If )
2]
∑
j
[
(Imj − ¯Im)
2]
]
1
2
,
(26)
∇φmDNCC =
−1
v
·
∑
i
[
(Ifi − ¯If ) · ∇φmImi
]
+
u
v3
·
∑
k
[
(Ifk − ¯If )
2]
·
∑
j
[
(Imj − ¯Im) · ∇φmImj
]
, (27)
HφmDNCC ≈
2 ·
∑
i[(Ifi − ¯If )
2
]
v3
·
∑
i
[
(Ifi − ¯If ) · ∇φmImi
]
·
∑
j
[
(Imj − ¯Im) · ∇φmImj
]T
−
3u · [
∑
i[(Ifi − ¯If )
2
]]
2
v5
·
∑
i
[
(Imi − ¯Im) · ∇φmImi
]
·
∑
j
[
(Imj − ¯Im) · ∇φmImj
]T
+
u ·
∑
i[(Ifi − ¯If )
2
]
v3
·
∑
i
[
∇φmImi · ∇φmI
T
mj
]
. (28)
B.2 Normalized Correlation
Note that to remain consistent with our discussion of mini-
mization, we define our cost function, DNCC , as the negative
normalized correlation. The DNCC measure is a ratio of two
terms
DNCC =
−u
v
(29)
where
u =
∑
i
[
(Ifi − ¯If )(Imi − ¯Im)
]
,
and
v =
[
∑
i
[
(Ifi − ¯If )
2]
∑
j
[
(Imj − ¯Im)
2]
]
1
2
,
terms become zero, allowing us to simplify Eq. 21 to
∂φP (A)
∂A(IJ )
T
=
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
0 0 0 0 0 0
0 0 −1 0 0 0
0 0 0 0 0 0
0 0 0 1 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0
0 −1 0 0 0 0
1 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 1
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
P(IJ )
.
As in the previous case, this allows us to calculate ∂φf P∂φm
Q
using Eq. 16.
Appendix B: Derivatives of DMSD , DNCC and DMI
In this appendix, we explicitly calculate the gradient and ap-
proximate Hessian for our image difference measures.
B.1 Mean Squared Difference
The measure is given by:
DMSD =
1
N
N
∑
i=1
[
If (xi) − Im(W(xi,φm))
]2
. (22)
For brevity, we will replace If (xi) with Ifi and Im(W(xi,
φm)) with Imi . The gradient and Gauss-Newton approxima-
tion to the Hessian of DMSD are straightforward. The gradi-
ent is given by:
∇φmDMSD =
2
N
N
∑
i=1
[[Ifi − Imi ] · ∇φmImi ] (23)
and the true Hessian is
HφmDMSD =
2
N
N
∑
i=1
([
∇φmImi · ∇φmI
T
mi
]
+ [Ifi − Imi ] · HφmImi
)
. (24)
Note that gradients are considered to be column vectors, so
the first term is an outer product. The Gauss-Newton approx-
imation is to drop the second term, giving:
HφmDMSD ≈
2
N
N
∑
i=1
[∇φmImi · ∇φmImi ]. (25)
u =
∑
i
[
(Ifi − ¯If )(Imi − ¯Im)
]
,
v =
[
∑
i
[
(Ifi − ¯If )
2]
∑
j
[
(Imj − ¯Im)
2]
]
1
2
,
(26)
∇φmDNCC =
−1
v
·
∑
i
[
(Ifi − ¯If ) · ∇φmImi
]
+
u
v3
·
∑
k
[
(Ifk − ¯If )
2]
·
∑
j
[
(Imj − ¯Im) · ∇φmImj
]
, (27)
HφmDNCC ≈
2 ·
∑
i[(Ifi − ¯If )
2
]
v3
·
∑
i
[
(Ifi − ¯If ) · ∇φmImi
]
·
∑
j
[
(Imj − ¯Im) · ∇φmImj
]T
−
3u · [
∑
i[(Ifi − ¯If )
2
]]
2
v5
·
∑
i
[
(Imi − ¯Im) · ∇φmImi
]
·
∑
j
[
(Imj − ¯Im) · ∇φmImj
]T
+
u ·
∑
i[(Ifi − ¯If )
2
]
v3
·
∑
i
[
∇φmImi · ∇φmI
T
mj
]
. (28)
B.2 Normalized Correlation
Note that to remain consistent with our discussion of mini-
mization, we define our cost function, DNCC , as the negative
normalized correlation. The DNCC measure is a ratio of two
terms
DNCC =
−u
v
(29)
where
u =
∑
i
[
(Ifi − ¯If )(Imi − ¯Im)
]
,
and
v =
[
∑
i
[
(Ifi − ¯If )
2]
∑
j
[
(Imj − ¯Im)
2]
]
1
2
,
Page 21
Int J Comput Vis (2010) 87: 191–212 211
where ¯If and ¯Im are the mean intensities for the fixed and
moving images, respectively. The gradient is then
∂DNCC
∂φ
=
1
v2
(
u
∂v
∂Im
− v
∂u
∂Im
)
∂Im
∂φ
.
Temporarily defining A = u ∂v∂Im , B = v
∂u
∂Im
, and C = v2 we
obtain a Hessian of
∂2DNCC
∂φ2
=
∂Im
∂φ
T (C ∂A∂Im + B
∂C
∂Im
− A ∂C∂Im − C
∂B
∂Im
C2
)
∂Im
∂φ
+
1
v2
(
u
∂v
∂Im
− v
∂u
∂Im
)
∂2I
∂φ2
.
The second term is equivalent to what is dropped in the
Gauss-Newton approximation for Mean Squares (Eq. 25).
We will also assume this term is small and drop it. We now
turn our attention to each of the parts of the first term.
Assuming that the mean values of the image, ¯Im, do not
appreciably change as the parameters change, the derivatives
of u and v with respect to a particular pixel in the moving
image, Imi , are:
∂u
∂Imi
= (Ifi − ¯If )
and
∂v
∂Imi
=
1
v
∑
k
[
(Ifk − ¯If )
2](Imi − ¯Im).
Letting F =
∑
k[(Ifk − ¯If )
2
], and noting that it is a constant,
A =
u
v
· F · (Imi − ¯Im),
and,
∂A
∂Imj
=
F
v
(Ifj − ¯If )(Imi − ¯Im)
−
uF 2
v3
(Imj − ¯Im)(Imi − ¯Im) +
uF
v
δij ,
where δij is an indicator function, equal to 1 if i = j and 0
otherwise. Continuing,
B = v · (Ifi − ¯If ),
∂B
∂Imj
=
F
v
· (Ifi − ¯If ) · (Imj − ¯Im),
and,
∂C
∂Imj
= 2F · (Imj − ¯Im).
Combining the terms and simplifying, we obtain Eqs. 27
and 28 for the gradient and Hessian of DNCC , respectively.
This Hessian is lengthy to write, but tractable to com-
pute. The first two terms of Eq. 28 can be computed from
summations over weighted gradients, and the fourth term is
the weighted summation of the outer product of ∂Im∂φ with
itself.
B.3 Mutual Information
As described in Sect. 4.1, mutual information is defined over
a joint probability distribution, which is a function of the
images.
DMI =
∑
f,m
Pf,m(φ) log
(
Pf,m
Pf Pm(φ)
)
(30)
where Pf,m is the joint distribution of intensities in the fixed
and moving images, Pm =
∑
f Pf,m(φ) is the distribution
of intensities in the moving image, and Pf =
∑
m Pf,m(φ)
is the distribution of intensities in the fixed image. Because
we use the formulation of (Thévenaz and Unser 2000), the
distribution of the fixed image does not change with φ, and
its derivative is zero.
To be consistent with our discussion of minimizing a cost
function, we will work with the negative mutual informa-
tion. Negative mutual information can also be expressed as
a sum of entropies, DMI = H(Pf,m) − H(Pf ) − H(Pm).
For brevity in what follows, we will calculate the derivative
of an entropy. The derivative of MI can then be evaluated by
combining the derivatives of its entropies. The entropy of a
probability distribution where each element, Pk , is a func-
tion of the moving image, Im, which is in turn a function of
some parameters, φ, is given by,
H(φ) = −
∑
k
Pk(Im(φ)) log(Pk(Im(φ))). (31)
Its derivative is then,
∂H
∂φ
= −
∑
k
[
log(Pk)(φ)
∂Pk
∂Im
∂Im
∂φ
+ Pk
∂ log(Pk)
∂Im
∂Im
∂φ
]
,
(32)
∂H
∂φ
= −
∑
k
[
log(Pk)(φ)
∂Pk
∂Im
∂Im
∂φ
+
Pk
Pk
∂Pk
∂Im
∂Im
∂φ
]
. (33)
Since the sum of the probability distribution, P , must be 1,
the sum of the derivatives of Pk must be zero; therefore, the
second term disappears. Taking the second derivative,
∂2H(φ)
∂φ2
= −
∑
k
[
∂Im
∂φ
∂Pk
∂Im
1
Pk(φ)
∂Pk
∂Im
∂Im
∂φ
+
∂Im
∂φ
∂2Pk
∂I 2m
∂Im
∂φ
log(Pk)(φ)
where ¯If and ¯Im are the mean intensities for the fixed and
moving images, respectively. The gradient is then
∂DNCC
∂φ
=
1
v2
(
u
∂v
∂Im
− v
∂u
∂Im
)
∂Im
∂φ
.
Temporarily defining A = u ∂v∂Im , B = v
∂u
∂Im
, and C = v2 we
obtain a Hessian of
∂2DNCC
∂φ2
=
∂Im
∂φ
T (C ∂A∂Im + B
∂C
∂Im
− A ∂C∂Im − C
∂B
∂Im
C2
)
∂Im
∂φ
+
1
v2
(
u
∂v
∂Im
− v
∂u
∂Im
)
∂2I
∂φ2
.
The second term is equivalent to what is dropped in the
Gauss-Newton approximation for Mean Squares (Eq. 25).
We will also assume this term is small and drop it. We now
turn our attention to each of the parts of the first term.
Assuming that the mean values of the image, ¯Im, do not
appreciably change as the parameters change, the derivatives
of u and v with respect to a particular pixel in the moving
image, Imi , are:
∂u
∂Imi
= (Ifi − ¯If )
and
∂v
∂Imi
=
1
v
∑
k
[
(Ifk − ¯If )
2](Imi − ¯Im).
Letting F =
∑
k[(Ifk − ¯If )
2
], and noting that it is a constant,
A =
u
v
· F · (Imi − ¯Im),
and,
∂A
∂Imj
=
F
v
(Ifj − ¯If )(Imi − ¯Im)
−
uF 2
v3
(Imj − ¯Im)(Imi − ¯Im) +
uF
v
δij ,
where δij is an indicator function, equal to 1 if i = j and 0
otherwise. Continuing,
B = v · (Ifi − ¯If ),
∂B
∂Imj
=
F
v
· (Ifi − ¯If ) · (Imj − ¯Im),
and,
∂C
∂Imj
= 2F · (Imj − ¯Im).
Combining the terms and simplifying, we obtain Eqs. 27
and 28 for the gradient and Hessian of DNCC , respectively.
This Hessian is lengthy to write, but tractable to com-
pute. The first two terms of Eq. 28 can be computed from
summations over weighted gradients, and the fourth term is
the weighted summation of the outer product of ∂Im∂φ with
itself.
B.3 Mutual Information
As described in Sect. 4.1, mutual information is defined over
a joint probability distribution, which is a function of the
images.
DMI =
∑
f,m
Pf,m(φ) log
(
Pf,m
Pf Pm(φ)
)
(30)
where Pf,m is the joint distribution of intensities in the fixed
and moving images, Pm =
∑
f Pf,m(φ) is the distribution
of intensities in the moving image, and Pf =
∑
m Pf,m(φ)
is the distribution of intensities in the fixed image. Because
we use the formulation of (Thévenaz and Unser 2000), the
distribution of the fixed image does not change with φ, and
its derivative is zero.
To be consistent with our discussion of minimizing a cost
function, we will work with the negative mutual informa-
tion. Negative mutual information can also be expressed as
a sum of entropies, DMI = H(Pf,m) − H(Pf ) − H(Pm).
For brevity in what follows, we will calculate the derivative
of an entropy. The derivative of MI can then be evaluated by
combining the derivatives of its entropies. The entropy of a
probability distribution where each element, Pk , is a func-
tion of the moving image, Im, which is in turn a function of
some parameters, φ, is given by,
H(φ) = −
∑
k
Pk(Im(φ)) log(Pk(Im(φ))). (31)
Its derivative is then,
∂H
∂φ
= −
∑
k
[
log(Pk)(φ)
∂Pk
∂Im
∂Im
∂φ
+ Pk
∂ log(Pk)
∂Im
∂Im
∂φ
]
,
(32)
∂H
∂φ
= −
∑
k
[
log(Pk)(φ)
∂Pk
∂Im
∂Im
∂φ
+
Pk
Pk
∂Pk
∂Im
∂Im
∂φ
]
. (33)
Since the sum of the probability distribution, P , must be 1,
the sum of the derivatives of Pk must be zero; therefore, the
second term disappears. Taking the second derivative,
∂2H(φ)
∂φ2
= −
∑
k
[
∂Im
∂φ
∂Pk
∂Im
1
Pk(φ)
∂Pk
∂Im
∂Im
∂φ
+
∂Im
∂φ
∂2Pk
∂I 2m
∂Im
∂φ
log(Pk)(φ)
Page 22
212 Int J Comput Vis (2010) 87: 191–212
+ log(Pk)(φ)
∂Pk
∂Im
∂2Im
∂φ2
]
. (34)
The third term is the equivalent to the term dropped in the
MSD and NCC approximate Hessians, so we will make the
same assumption and drop it for MI as well.
References
Andreopoulos, A., & Tsotsos, J. K. (2005). A novel algorithm for fit-
ting 3d active appearance models: applications to cardiac mri seg-
mentation. In Proceedings of the 14th Scandinavian conference
on image analysis, Joensuu, Finland.
Anuta, P. E. (1969). Digital registration of multispectral imagery. SPIE
Journal, 7(6), 168–175.
Baker, S., & Matthews, I. (2004). Lucas-Kanade 20 years on: a uni-
fied framework. International Journal of Computer Vision, 56(3),
221–255.
Baker, S., Patil, R., Cheung, K. M., & Matthews, I. (2004). Lucas-
Kanade 20 years on: part 5 (Technical Report CMU-RI-TR-04-
64). Robotics Institute, Carnegie Mellon University, Pittsburgh,
PA.
Bartoli, A. (2006). Groupwise geometric and photometric direct image
registration. In Proceedings of the seventeenth British machine
vision conference (Vol. 1, pp. 157–66), Edinburgh, UK.
Benhimane, S., & Malis, E. (2004). Real-time image-based tracking of
planes using efficient second-order minimization. In IEEE/RSJ in-
ternational conference on intelligent robots and systems, Sendai,
Japan.
Brooks, R., & Arbel, T. (2006). Generalizing inverse compositional
image alignment. In Proceedings of the 18th international confer-
ence on pattern recognition (ICPR2006) (Vol. 2, pp. 1200–1203),
Hong Kong.
Brown, L. G. (1992). A survey of image registration techniques. ACM
Computing Surveys, 24(4), 325–376.
Brown, M., & Lowe, D. (2007). Automatic panoramic image stitch-
ing using invariant features. International Journal of Computer
Vision, 74(1), 59–73.
Byrd, R. H., Lu, P., & Nocedal, J. (1995). A limited memory algorithm
for bound constrained optimization. SIAM Journal on Scientific
and Statistical Computing, 16(5), 1190–1208.
Conn, A. R., Gould, N. I. M., & Toint, P. L. (2000). Trust-region meth-
ods. Society for Industrial and Applied Mathematics and Mathe-
matical Programming Society.
Dennis, J. E., Jr., & Wolkowicz, H. (1993). Sizing and least-change se-
cant methods. SIAM Journal of Numerical Analysis, 30(5), 1291–
1314.
Dowson, N., & Bowden, R. (2008). Mutual information for Lucas-
Kanade tracking (MILK): an inverse compositional formulation.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
30(1), 180–185.
Fletcher, R. (1987). Practical methods of optimization. New York: Wi-
ley.
Gill, P. E., Murray, W., & Wright, M. H. (1981). Practical optimiza-
tion. New York: Academic Press.
Harshman, R. A. (2001). An index formalism that generalizes the ca-
pabilities of matrix notation and algebra to n-way arrays. Journal
of Chemometrics, 15(9), 689–714.
Ibanez, L., Schroeder, W., Ng, L., & Cates, J. (2005). The ITK software
guide: ITK V2.0. Kitware Inc., Clifton Park. http://www.itk.org.
Keller, Y., & Averbuch, A. (2004). Fast motion estimation using
bi-directional gradient methods. IEEE Transactions on Image
Processing, 13(8), 1042–1054.
Kwan, R.-S., Evans, A., & Pike, G. (1999). MRI simulation-based
evaluation of image-processing and classification methods. IEEE
Transactions on Medical Imaging, 18(11), 1085–1097.
Liu, T.-L., & Chen, H.-T. (2004). Real-time tracking using trust-region
methods. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 26(3), 397–402.
Lucas, B. D., & Kanade, T. (1981). An iterative image registration tech-
nique with an application to stereo vision. In Proceedings of the
international joint conference on artificial intelligence (pp. 674–
679).
Malis, E. (2004). Improving vision-based control using efficient
second-order minimization techniques. In Proceedings of the
2004 IEEE international conference on robotics and automation
(ICRA04) (pp. 1843–1848), New Orleans, LA, USA.
Malis, E., & Benhimane, S. (2005). A unified approach to visual track-
ing and servoing. Robotics and Autonomous Systems, 52(1), 39–
52.
Matthews, I., & Baker, S. (2004). Active appearance models revisited.
International Journal of Computer Vision, 60(2), 135–164.
Moler, C., & Van Loan, C. (2003). Nineteen dubious ways to compute
the exponential of a matrix, twenty five years later. SIAM Review,
45(1), 3–49.
Pluim, J. P. W., Maintz, J. B. A., & Viergever, M. A. (2003). Mutual-
information based registration of medical images: a survey. IEEE
Transactions on Medical Imaging, 22(8), 986–1004.
Sheskin, D. J. (2000). Handbook of parametric and nonparametric sta-
tistical procedures (2nd ed.). New York: Chapman and Hall/CRC.
Simard, P. Y., Cun, Y. A. L., Denker, J. S., & Victorri, B. (2000). Trans-
formation invariance in pattern recognition: tangent distance and
propagation. International Journal of Imaging Systems and Tech-
nology, 11(3), 181–197.
Szeliski, R. (2004). Image alignment and stitching: a tutorial (Techni-
cal Report MSR-TR-2004-92). Microsoft Research).
Thévenaz, P., & Unser, M. (2000). Optimization of mutual informa-
tion for multiresolution image registration. IEEE Transactions on
Image Processing, 9(12), 2083–2099.
Toews, M., Collins, D. L., & Arbel, T. (2005). Maximum a posteri-
ori local histogram estimation for image registration. In Proceed-
ings of medical image computing and computer aided intervention
(MICCAI 2005) (Vol. 2, pp. 163–170), Palm Springs, CA, USA.
van de Kraats, E., Penney, G., Tomazevic, D., van Walsum, T., &
Niessen, W. (2005). Standardized evaluation methodology for 2D-
3D registration. IEEE Transactions on Medical Imaging, 24(9),
1177–1190.
Venkataraman, P. (2001). Applied optimization with MATLAB pro-
gramming. New York: Canada.
Viola, P., & Wells, W. M., III (1995). Alignment by maximization of
mutual information. In Proceedings of the 5th IEEE international
conference on computer vision (ICCV1995) (pp. 16–23). Cam-
bridge: IEEE Computer Society.
Weidendorfer, J., Kowarschik, M., & Trinitis, C. (2004). A tool suite
for simulation based analysis of memory access behavior. In Pro-
ceedings of the 4th international conference on computational sci-
ence (ICCS 2004), Krakow, Poland.
Zitova, B., & Flusser, J. (2003). Image registration methods: a survey.
Image and Vision Computing, 21(11), 977–1000.
+ log(Pk)(φ)
∂Pk
∂Im
∂2Im
∂φ2
]
. (34)
The third term is the equivalent to the term dropped in the
MSD and NCC approximate Hessians, so we will make the
same assumption and drop it for MI as well.
References
Andreopoulos, A., & Tsotsos, J. K. (2005). A novel algorithm for fit-
ting 3d active appearance models: applications to cardiac mri seg-
mentation. In Proceedings of the 14th Scandinavian conference
on image analysis, Joensuu, Finland.
Anuta, P. E. (1969). Digital registration of multispectral imagery. SPIE
Journal, 7(6), 168–175.
Baker, S., & Matthews, I. (2004). Lucas-Kanade 20 years on: a uni-
fied framework. International Journal of Computer Vision, 56(3),
221–255.
Baker, S., Patil, R., Cheung, K. M., & Matthews, I. (2004). Lucas-
Kanade 20 years on: part 5 (Technical Report CMU-RI-TR-04-
64). Robotics Institute, Carnegie Mellon University, Pittsburgh,
PA.
Bartoli, A. (2006). Groupwise geometric and photometric direct image
registration. In Proceedings of the seventeenth British machine
vision conference (Vol. 1, pp. 157–66), Edinburgh, UK.
Benhimane, S., & Malis, E. (2004). Real-time image-based tracking of
planes using efficient second-order minimization. In IEEE/RSJ in-
ternational conference on intelligent robots and systems, Sendai,
Japan.
Brooks, R., & Arbel, T. (2006). Generalizing inverse compositional
image alignment. In Proceedings of the 18th international confer-
ence on pattern recognition (ICPR2006) (Vol. 2, pp. 1200–1203),
Hong Kong.
Brown, L. G. (1992). A survey of image registration techniques. ACM
Computing Surveys, 24(4), 325–376.
Brown, M., & Lowe, D. (2007). Automatic panoramic image stitch-
ing using invariant features. International Journal of Computer
Vision, 74(1), 59–73.
Byrd, R. H., Lu, P., & Nocedal, J. (1995). A limited memory algorithm
for bound constrained optimization. SIAM Journal on Scientific
and Statistical Computing, 16(5), 1190–1208.
Conn, A. R., Gould, N. I. M., & Toint, P. L. (2000). Trust-region meth-
ods. Society for Industrial and Applied Mathematics and Mathe-
matical Programming Society.
Dennis, J. E., Jr., & Wolkowicz, H. (1993). Sizing and least-change se-
cant methods. SIAM Journal of Numerical Analysis, 30(5), 1291–
1314.
Dowson, N., & Bowden, R. (2008). Mutual information for Lucas-
Kanade tracking (MILK): an inverse compositional formulation.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
30(1), 180–185.
Fletcher, R. (1987). Practical methods of optimization. New York: Wi-
ley.
Gill, P. E., Murray, W., & Wright, M. H. (1981). Practical optimiza-
tion. New York: Academic Press.
Harshman, R. A. (2001). An index formalism that generalizes the ca-
pabilities of matrix notation and algebra to n-way arrays. Journal
of Chemometrics, 15(9), 689–714.
Ibanez, L., Schroeder, W., Ng, L., & Cates, J. (2005). The ITK software
guide: ITK V2.0. Kitware Inc., Clifton Park. http://www.itk.org.
Keller, Y., & Averbuch, A. (2004). Fast motion estimation using
bi-directional gradient methods. IEEE Transactions on Image
Processing, 13(8), 1042–1054.
Kwan, R.-S., Evans, A., & Pike, G. (1999). MRI simulation-based
evaluation of image-processing and classification methods. IEEE
Transactions on Medical Imaging, 18(11), 1085–1097.
Liu, T.-L., & Chen, H.-T. (2004). Real-time tracking using trust-region
methods. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 26(3), 397–402.
Lucas, B. D., & Kanade, T. (1981). An iterative image registration tech-
nique with an application to stereo vision. In Proceedings of the
international joint conference on artificial intelligence (pp. 674–
679).
Malis, E. (2004). Improving vision-based control using efficient
second-order minimization techniques. In Proceedings of the
2004 IEEE international conference on robotics and automation
(ICRA04) (pp. 1843–1848), New Orleans, LA, USA.
Malis, E., & Benhimane, S. (2005). A unified approach to visual track-
ing and servoing. Robotics and Autonomous Systems, 52(1), 39–
52.
Matthews, I., & Baker, S. (2004). Active appearance models revisited.
International Journal of Computer Vision, 60(2), 135–164.
Moler, C., & Van Loan, C. (2003). Nineteen dubious ways to compute
the exponential of a matrix, twenty five years later. SIAM Review,
45(1), 3–49.
Pluim, J. P. W., Maintz, J. B. A., & Viergever, M. A. (2003). Mutual-
information based registration of medical images: a survey. IEEE
Transactions on Medical Imaging, 22(8), 986–1004.
Sheskin, D. J. (2000). Handbook of parametric and nonparametric sta-
tistical procedures (2nd ed.). New York: Chapman and Hall/CRC.
Simard, P. Y., Cun, Y. A. L., Denker, J. S., & Victorri, B. (2000). Trans-
formation invariance in pattern recognition: tangent distance and
propagation. International Journal of Imaging Systems and Tech-
nology, 11(3), 181–197.
Szeliski, R. (2004). Image alignment and stitching: a tutorial (Techni-
cal Report MSR-TR-2004-92). Microsoft Research).
Thévenaz, P., & Unser, M. (2000). Optimization of mutual informa-
tion for multiresolution image registration. IEEE Transactions on
Image Processing, 9(12), 2083–2099.
Toews, M., Collins, D. L., & Arbel, T. (2005). Maximum a posteri-
ori local histogram estimation for image registration. In Proceed-
ings of medical image computing and computer aided intervention
(MICCAI 2005) (Vol. 2, pp. 163–170), Palm Springs, CA, USA.
van de Kraats, E., Penney, G., Tomazevic, D., van Walsum, T., &
Niessen, W. (2005). Standardized evaluation methodology for 2D-
3D registration. IEEE Transactions on Medical Imaging, 24(9),
1177–1190.
Venkataraman, P. (2001). Applied optimization with MATLAB pro-
gramming. New York: Canada.
Viola, P., & Wells, W. M., III (1995). Alignment by maximization of
mutual information. In Proceedings of the 5th IEEE international
conference on computer vision (ICCV1995) (pp. 16–23). Cam-
bridge: IEEE Computer Society.
Weidendorfer, J., Kowarschik, M., & Trinitis, C. (2004). A tool suite
for simulation based analysis of memory access behavior. In Pro-
ceedings of the 4th international conference on computational sci-
ence (ICCS 2004), Krakow, Poland.
Zitova, B., & Flusser, J. (2003). Image registration methods: a survey.
Image and Vision Computing, 21(11), 977–1000.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
8 Readers on Mendeley
by Discipline
13% Engineering
by Academic Status
38% Researcher (at a non-Academic Institution)
38% Ph.D. Student
13% Student (Bachelor)
by Country
25% United States
25% France
13% China


