Sign up & Download
Sign in

An Introduction to the Conjugate Gradient Method Without the Agonizing Pain

by Jonathan Richard Shewchuk
Science (1994)

Abstract

The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations. Unfortunately, many textbook treatments of the topic are written with neither illustrations nor intuition, and their victims can be found to this day babbling senselessly in the corners of dusty libraries. For this reason, a deep, geometric understanding of the method has been reserved for the elite brilliant few who have painstakingly decoded the mumblings of their forebears. Nevertheless, the Conjugate Gradient Method is a composite of simple, elegant ideas that almost anyone can understand. Of course, a reader as intelligent as yourself will learn them almost effortlessly. The idea of quadratic forms is introduced and used to derive the methods of Steepest Descent, Conjugate Directions, and Conjugate Gradients. Eigenvectors are explained and used to examine the convergence of the Jacobi Method, Steepest Descent, and Conjugate Gradients. Other topics include preconditioning and the nonlinear Conjugate Gradient Method. I have taken pains to make this article easy to read. Sixty-six illustrations are provided. Dense prose is avoided. Concepts are explained in several differentways. Most equations are coupled with an intuitive interpretation.

Cite this document (BETA)

Available from citeseerx.ist.psu.edu
Page 1
hidden

An Introduction to the Conjugate Gradient Method Without the Agonizing Pain

An Introduction to
the Conjugate Gradient Method
Without the Agonizing Pain
Edition 1 14
Jonathan Richard Shewchuk
August 4, 1994
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Abstract
The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations.
Unfortunately, many textbook treatments of the topic are written with neither illustrations nor intuition, and their
victims can be found to this day babbling senselessly in the corners of dusty libraries. For this reason, a deep,
geometric understanding of the method has been reserved for the elite brilliant few who have painstakingly decoded
the mumblings of their forebears. Nevertheless, the Conjugate Gradient Method is a composite of simple, elegant ideas
that almost anyone can understand. Of course, a reader as intelligent as yourself will learn them almost effortlessly.
The idea of quadratic forms is introduced and used to derive the methods of Steepest Descent, Conjugate Directions,
and Conjugate Gradients. Eigenvectors are explained and used to examine the convergence of the Jacobi Method,
Steepest Descent, and Conjugate Gradients. Other topics include preconditioning and the nonlinear Conjugate Gradient
Method. I have taken pains to make this article easy to read. Sixty-six illustrations are provided. Dense prose is
avoided. Concepts are explained in several different ways. Most equations are coupled with an intuitive interpretation.
Supported in part by the Natural Sciences and Engineering Research Council of Canada under a 1967 Science and Engineering
Scholarship and by the National Science Foundation under Grant ASC-9318163. The views and conclusions contained in this
document are those of the author and should not be interpreted as representing the official policies, either express or implied, of
NSERC, NSF, or the U.S. Government.
Page 2
hidden
Keywords: conjugate gradient method, preconditioning, convergence analysis, agonizing pain
Page 3
hidden
Contents
1. Introduction 1
2. Notation 1
3. The Quadratic Form 2
4. The Method of Steepest Descent 6
5. Thinking with Eigenvectors and Eigenvalues 9
5.1. Eigen do it if I try  9
5.2. Jacobi iterations  11
5.3. A Concrete Example  12
6. Convergence Analysis of Steepest Descent 13
6.1. Instant Results  13
6.2. General Convergence  17
7. The Method of Conjugate Directions 21
7.1. Conjugacy  21
7.2. Gram-Schmidt Conjugation  25
7.3. Optimality of the Error Term  26
8. The Method of Conjugate Gradients 30
9. Convergence Analysis of Conjugate Gradients 32
9.1. Picking Perfect Polynomials  33
9.2. Chebyshev Polynomials  35
10. Complexity 37
11. Starting and Stopping 38
11.1. Starting  38
11.2. Stopping  38
12. Preconditioning 39
13. Conjugate Gradients on the Normal Equations 41
14. The Nonlinear Conjugate Gradient Method 42
14.1. Outline of the Nonlinear Conjugate Gradient Method  42
14.2. General Line Search  43
14.3. Preconditioning  47
A Notes 48
B Canned Algorithms 49
B1. Steepest Descent  49
B2. Conjugate Gradients  50
B3. Preconditioned Conjugate Gradients  51
i
Page 4
hidden
B4. Nonlinear Conjugate Gradients with Newton-Raphson and Fletcher-Reeves  52
B5. Preconditioned Nonlinear Conjugate Gradients with Secant and Polak-Ribie`re  53
C Ugly Proofs 54
C1. The Solution to


Minimizes the Quadratic Form  54
C2. A Symmetric Matrix Has Orthogonal Eigenvectors.  54
C3. Optimality of Chebyshev Polynomials  55
D Homework 56
ii
Page 5
hidden
About this Article
An electronic copy of this article is available by anonymous FTP to WARP.CS.CMU.EDU (IP address
128.2.209.103) under the filename quake-papers/painless-conjugate-gradient.ps. A
PostScript file containing full-page copies of the figures herein, suitable for transparencies, is available
as quake-papers/painless-conjugate-gradient-pics.ps. Most of the illustrations were
created using Mathematica.
c

1994 by Jonathan Richard Shewchuk. This article may be freely duplicated and distributed so long as
no consideration is received in return, and this copyright notice remains intact.
This guide was created to help students learn Conjugate Gradient Methods as easily as possible. Please
mail me (jrs@cs.cmu.edu) comments, corrections, and any intuitions I might have missed; some of
these will be incorporated into a second edition. I am particularly interested in hearing about use of this
guide for classroom teaching.
For those who wish to learn more about iterative methods, I recommend William L. Briggs’ “A Multigrid
Tutorial” [2], one of the best-written mathematical books I have read.
Special thanks to Omar Ghattas, who taught me much of what I know about numerical methods, and
provided me with extensive comments on the first draft of this article. Thanks also to James Epperson,
David O’Hallaron, James Stichnoth, Nick Trefethen, and Daniel Tunkelang for their comments.
To help you skip chapters, here’s a dependence graph of the sections:
1 Introduction
2 Notation
10 Complexity
13 Normal Equations
3 Quadratic Form
4 Steepest Descent 5 Eigenvectors
6 SD Convergence7 Conjugate Directions
8 Conjugate Gradients 9 CG Convergence
11 Stop & Start 12 Preconditioning 14 Nonlinear CG
This article is dedicated to every mathematician who uses figures as abundantly as I have herein.
iii
Page 7
hidden
1. Introduction
When I decided to learn the Conjugate Gradient Method (henceforth, CG), I read four different descriptions,
which I shall politely not identify. I understood none of them. Most of them simply wrote down the method,
then proved its properties without any intuitive explanation or hint of how anybody might have invented CG
in the first place. This article was born of my frustration, with the wish that future students of CG will learn
a rich and elegant algorithm, rather than a confusing mass of equations.
CG is the most popular iterative method for solving large systems of linear equations. CG is effective
for systems of the form

 (1)
where

is an unknown vector,

is a known vector, and

is a known, square, symmetric, positive-definite
(or positive-indefinite) matrix. (Don’t worry if you’ve forgotten what “positive-definite” means; we shall
review it.) These systems arise in many important settings, such as finite difference and finite element
methods for solving partial differential equations, structural analysis, circuit analysis, and math homework.
Iterative methods like CG are suited for use with sparse matrices. If

is dense, your best course of
action is probably to factor

and solve the equation by backsubstitution. The time spent factoring a dense

is roughly equivalent to the time spent solving the system iteratively; and once

is factored, the system
can be backsolved quickly for multiple values of

. Compare this dense matrix with a sparse matrix of
larger size that fills the same amount of memory. The triangular factors of a sparse

usually have many
more nonzero elements than

itself. Factoring may be impossible due to limited memory, and will be
time-consuming as well; even the backsolving step may be slower than iterative solution. On the other
hand, most iterative methods are memory-efficient and run quickly with sparse matrices.
I assume that you have taken a first course in linear algebra, and that you have a solid understanding
of matrix multiplication and linear independence, although you probably don’t remember what those
eigenthingies were all about. From this foundation, I shall build the edifice of CG as clearly as I can.
2. Notation
Let us begin with a few definitions and notes on notation.
With a few exceptions, I shall use capital letters to denote matrices, lower case letters to denote vectors,
and Greek letters to denote scalars.

is an  matrix, and

and

are vectors — that is,  1 matrices.
Written out fully, Equation 1 is






11

12 

1 

21

22

2 
.
.
.
.
.
.
.
.
.

 1

 2 


ff










1

2
.
.
.


ff











1

2
.
.
.


ff





The inner product of two vectors is written
ffifl 
, and represents the scalar sum ! "$# 1

"

"
. Note that
ffifl %&fl 
. If

and

are orthogonal, then
ffifl 
0. In general, expressions that reduce to 1  1 matrices,
such as
ffifl 
and
'fl(
, are treated as scalar values.
Page 8
hidden
2 Jonathan Richard Shewchuk
-4 -2 2 4 6
-6
-4
-2
2
4

1

2
3

1 ) 2

2

2
2

1 ) 6

2
+*
8
Figure 1: Sample two-dimensional linear system. The solution lies at the intersection of the lines.
A matrix

is positive-definite if, for every nonzero vector  ,


-,
0 (2)
This may mean little to you, but don’t feel bad; it’s not a very intuitive idea, and it’s hard to imagine how
a matrix that is positive-definite might look differently from one that isn’t. We will get a feeling for what
positive-definiteness is about when we see how it affects the shape of quadratic forms.
Finally, don’t forget the important basic identities .
0/12fl34/5fl 6fl
and .
0/7198 1 4/:8 1 8 1
.
3. The Quadratic Form
A quadratic form is simply a scalar, quadratic function of a vector with the form
;
.
<1= 1
2


>*?


)A@
(3)
where

is a matrix,

and

are vectors, and
@
is a scalar constant. I shall show shortly that if

is symmetric
and positive-definite,
;
.
<1
is minimized by the solution to


.
Throughout this paper, I will demonstrate ideas with the simple sample problem
4CB 3 2
2 6 DE
FCB 2
*
8 DGE @

0 (4)
The system

HI
is illustrated in Figure 1. In general, the solution

lies at the intersection point
of hyperplanes, each having dimension
*
1. For this problem, the solution is
%KJ
2
E
*
2 L

. The
corresponding quadratic form
;
.
M1
appears in Figure 2. A contour plot of
;
.
<1
is illustrated in Figure 3.
Page 9
hidden
The Quadratic Form 3
-4
-2
0
2
4
6
-6
-4
-2
0
2
4
0
50
100
150

1

2
;
.
<1

Figure 2: Graph of a quadratic form N<OQPSR . The minimum point of this surface is the solution to TUPWVYX .
-4 -2 2 4 6
-6
-4
-2
2
4

1

2
Figure 3: Contours of the quadratic form. Each ellipsoidal curve has constant N<O$PZR .
Page 10
hidden
4 Jonathan Richard Shewchuk
-4 -2 2 4 6
-8
-6
-4
-2
2
4
6

1

2
Figure 4: Gradient NZ[\O$PZR of the quadratic form. For every P , the gradient points in the direction of steepest
increase of N<O$PZR , and is orthogonal to the contour lines.
Because

is positive-definite, the surface defined by
;
.
M1
is shaped like a paraboloid bowl. (I’ll have more
to say about this in a moment.)
The gradient of a quadratic form is defined to be
;^]
.
M1=






_
_`
1
;
.
<1
_
_`
2
;
.
<1
.
.
.
_
_a`ab
;
.
M1






(5)
The gradient is a vector field that, for a given point

, points in the direction of greatest increase of
;
.
<1
.
Figure 4 illustrates the gradient vectors for Equation 3 with the constants given in Equation 4. At the bottom
of the paraboloid bowl, the gradient is zero. One can minimize
;
.
<1
by setting
;
]
.
M1
equal to zero.
With a little bit of tedious math, one can apply Equation 5 to Equation 3, and derive
;
]
.
M1c 1
2



)
1
2
:*?
(6)
If

is symmetric, this equation reduces to
;
]
.
M1c4
d*?
(7)
Setting the gradient to zero, we obtain Equation 1, the linear system we wish to solve. Therefore, the
solution to

%e
is a critical point of
;
.
M1
. If

is positive-definite as well as symmetric, then this
Page 11
hidden
The Quadratic Form 5
(c)

1

2
;
.
<1

(d)

1

2
;
.
<1

(a)

1

2
;
.
<1

(b)

1

2
;
.
<1

Figure 5: (a) Quadratic form for a positive-definite matrix. (b) For a negative-definite matrix. (c) For a
singular (and positive-indefinite) matrix. A line that runs through the bottom of the valley is the set of
solutions. (d) For an indefinite matrix. Because the solution is a saddle point, Steepest Descent and CG
will not work. In three dimensions or higher, a singular matrix can also have a saddle.
solution is a minimum of
;
.
M1
, so
fH
can be solved by finding an

that minimizes
;
.
<1
. (If  is not
symmetric, then Equation 6 hints that CG will find a solution to the system 12 .
6fl
)
G1gYh
. Note that
1
2 .
6fl
)
G1
is symmetric.)
Why do symmetric positive-definite matrices have this nice property? Consider the relationship between
;
at some arbitrary point i and at the solution point
W

8 1
. From Equation 3 one can show (Appendix C1)
that if

is symmetric (be it positive-definite or not),
;
.ji
1=
;
.
M1
)
1
2
.ji
*Y<1


.ji
*kM1
(8)
If

is positive-definite as well, then by Inequality 2, the latter term is positive for all iYl
4
. It follows that

is a global minimum of
;
.
The fact that
;
.
M1
is a paraboloid is our best intuition of what it means for a matrix to be positive-definite.
If

is not positive-definite, there are several other possibilities.

could be negative-definite — the result
of negating a positive-definite matrix (see Figure 2, but hold it upside-down).  might be singular, in which
case no solution is unique; the set of solutions is a line or hyperplane having a uniform value for
;
. If

is none of the above, then

is a saddle point, and techniques like Steepest Descent and CG will likely fail.
Figure 5 demonstrates the possibilities. The values of

and
@
determine where the minimum point of the
paraboloid lies, but do not affect the paraboloid’s shape.
Why go to the trouble of converting the linear system into a tougher-looking problem? The methods
under study — Steepest Descent and CG — were developed and are intuitively understood in terms of
minimization problems like Figure 2, not in terms of intersecting hyperplanes such as Figure 1.
Page 13
hidden
The Method of Steepest Descent 7
0.2 0.4 0.6
20
40
60
80
100
120
140
-4 -2 2 4 6
-6
-4
-2
2
4
-4 -2 2 4 6
-6
-4
-2
2
4
-2.5
0 2.5
5
-5
-2.5
0
2.5 0
50
100
150
y
(c); .  m " n )Ay r m " n 1

1
(d)

2

m
1 n

1
(a)

2
 m
0 n
p
m
0 n

(b)

1

2
;
.
M1

Figure 6: The method of Steepest Descent. (a) Starting at ~j 2 € 2 ‚„ƒ , take a step in the direction of steepest
descent of N . (b) Find the point on the intersection of these two surfaces that minimizes N . (c) This parabola
is the intersection of surfaces. The bottommost point is our target. (d) The gradient at the bottommost point
is orthogonal to the gradient of the previous step.
-2 -1 1 2 3
-3
-2
-1
1
2

1

2
Figure 7: The gradient NZ[ is shown at several locations along the search line (solid arrows). Each gradient’s
projection onto the line is also shown (dotted arrows). The gradient vectors represent the direction of
steepest increase of N , and the projections represent the rate of increase as one traverses the search line.
On the search line, N is minimized where the gradient is orthogonal to the search line.
Page 14
hidden
8 Jonathan Richard Shewchuk
-4 -2 2 4 6
-6
-4
-2
2
4

1

2
 m
0 n

Figure 8: Here, the method of Steepest Descent starts at ~j 2 € 2 ‚„ƒ and converges at ~ 2 €9 2 ‚„ƒ .
Putting it all together, the method of Steepest Descent is:
r
m
"
n
 U*?(m
"
n
E
(10)
y
m
"
n

r

m
"
n
r
m
"
n
r

m
"
n

r
m
"
n
E
(11)
(m
"$…
1 n
%(m
"
n
)Ay
m
"
n
r
m
"
n
(12)
The example is run until it converges in Figure 8. Note the zigzag path, which appears because each
gradient is orthogonal to the previous gradient.
The algorithm, as written above, requires two matrix-vector multiplications per iteration. The computa-
tional cost of Steepest Descent is dominated by matrix-vector products; fortunately, one can be eliminated.
By premultiplying both sides of Equation 12 by
*
and adding

, we have
r
m
"$…
1 n

r
m
"
n
*
y
m
"
n

r
m
"
n
(13)
Although Equation 10 is still needed to compute r
m
0 n , Equation 13 can be used for every iteration thereafter.
The product

r , which occurs in both Equations 11 and 13, need only be computed once. The disadvantage
of using this recurrence is that the sequence defined by Equation 13 is generated without any feedback from
the value of

m
"
n
, so that accumulation of floating point roundoff error may cause

m
"
n
to converge to some
point near

. This effect can be avoided by periodically using Equation 10 to recompute the correct residual.
Before analyzing the convergence of Steepest Descent, I must digress to ensure that you have a solid
understanding of eigenvectors.
Page 16
hidden
10 Jonathan Richard Shewchuk
If
/
is symmetric (and often if it is not), then there exists a set of linearly independent eigenvectors
of
/
, denoted † 1
E
† 2
E

E
†

. This set is not unique, because each eigenvector can be scaled by an arbitrary
nonzero constant. Each eigenvector has a corresponding eigenvalue, denoted ‡ 1
E
‡ 2
E

E
‡

. These are
uniquely defined for a given matrix. The eigenvalues may or may not be equal to each other; for instance,
the eigenvalues of the identity matrix ” are all one, and every nonzero vector is an eigenvector of ” .
What if
/
is applied to a vector that is not an eigenvector? A very important skill in understanding
linear algebra — the skill this section is written to teach — is to think of a vector as a sum of other vectors
whose behavior is understood. Consider that the set of eigenvectors •a† "—– forms a basis for ˜  (because a
symmetric
/
has eigenvectors that are linearly independent). Any -dimensional vector can be expressed
as a linear combination of these eigenvectors, and because matrix-vector multiplication is distributive, one
can examine the effect of
/
on each eigenvector separately.
In Figure 11, a vector

is illustrated as a sum of two eigenvectors † 1 and † 2. Applying
/
to

is
equivalent to applying
/
to the eigenvectors, and summing the result. On repeated application, we have
/
"
™4/
"
† 1 )
/
"
† 2

‡
"
1 † 1 ) ‡
"
2 † 2. If the magnitudes of all the eigenvalues are smaller than one,
/
"

will
converge to zero, because the eigenvectors that compose

converge to zero when
/
is repeatedly applied.
If one of the eigenvalues has magnitude greater than one,

will diverge to infinity. This is why numerical
analysts attach importance to the spectral radius of a matrix:
š
.
/1=
max ˆ‰‡
"
ˆ
E
‡
" is an eigenvalue of
/
.
If we want

to converge to zero quickly, š .
/1
should be less than one, and preferably as small as possible.
B x
B x2


x
Bxv v
1 2

Figure 11: The vector P (solid arrow) can be expressed as a linear combination of eigenvectors (dashed
arrows), whose associated eigenvalues are › 1 V 0 ‘ 7 and › 2 VA 2. The effect of repeatedly applying  to P
is best understood by examining the effect of  on each eigenvector. When  is repeatedly applied, one
eigenvector converges to zero while the other diverges; hence, “\P also diverges.
It’s important to mention that there is a family of nonsymmetric matrices that do not have a full set of
independent eigenvectors. These matrices are known as defective, a name that betrays the well-deserved
hostility they have earned from frustrated linear algebraists. The details are too complicated to describe in
this article, but the behavior of defective matrices can be analyzed in terms of generalized eigenvectors and
generalized eigenvalues. The rule that
/
"

converges to zero if and only if all the generalized eigenvalues
have magnitude smaller than one still holds, but is harder to prove.
Here’s a useful fact: the eigenvalues of a positive-definite matrix are all positive. This fact can be proven
from the definition of eigenvalue:
/
†

‡œ†
†

/
†

‡œ†

†

By the definition of positive-definite, the †
fl(/
† is positive (for nonzero † ). Hence, ‡ must be positive also.
Page 17
hidden
Thinking with Eigenvectors and Eigenvalues 11
5.2. Jacobi iterations
Of course, a procedure that always converges to zero isn’t going to help you attract friends. Consider a more
useful procedure: the Jacobi Method for solving

3u
. The matrix

is split into two parts:  , whose
diagonal elements are identical to those of

, and whose off-diagonal elements are zero; and ž , whose
diagonal elements are zero, and whose off-diagonal elements are identical to those of

. Thus,



)
ž .
We derive the Jacobi Method:

 

  *
ž

)

  *

8 1
ž

)

8 1
  /W
)AŸ
E
where
/ +*

8 1
ž
E
Ÿ


8 1
(14)
Because  is diagonal, it is easy to invert. This identity can be converted into an iterative method by
forming the recurrence
 m
"„…
1 n

/5(m
"
n
)¡Ÿ
(15)
Given a starting vector

m
0 n , this formula generates a sequence of vectors. Our hope is that each successive
vector will be closer to the solution

than the last.

is called a stationary point of Equation 15, because if
 m
"
n
%
, then
 m
"„…
1 n will also equal

.
Now, this derivation may seem quite arbitrary to you, and you’re right. We could have formed any
number of identities for

instead of Equation 14. In fact, simply by splitting

differently — that is,
by choosing a different  and ž — we could have derived the Gauss-Seidel method, or the method of
Successive Over-Relaxation (SOR). Our hope is that we have chosen a splitting for which / has a small
spectral radius. Here, I chose the Jacobi splitting arbitrarily for simplicity.
Suppose we start with some arbitrary vector

m
0 n . For each iteration, we apply
/
to this vector, then add
Ÿ
to the result. What does each iteration do?
Again, apply the principle of thinking of a vector as a sum of other, well-understood vectors. Express
each iterate

m
"
n
as the sum of the exact solution

and the error term p
m
"
n
. Then, Equation 15 becomes

m
"„…
1 n
 /W
m
"
n
)AŸ
 /
.

)
p
m
"
n
1
)¡Ÿ
 /W
)AŸ
)
/
p
m
"
n
 
)
/
p
m
"
n
(by Equation 14)
E
¢
p
m
"„…
1 n
 /
p
m
"
n
(16)
Each iteration does not affect the “correct part” of
 m
"
n
(because  is a stationary point); but each iteration
does affect the error term. It is apparent from Equation 16 that if š .
/1
‹ 1, then the error term p
m
"
n
will
converge to zero as Œ approaches infinity. Hence, the initial vector
 m
0 n has no effect on the inevitable
outcome!
Of course, the choice of
(m
0 n does affect the number of iterations required to converge to

within
a given tolerance. However, its effect is less important than that of the spectral radius š .
/1
, which
determines the speed of convergence. Suppose that †a£ is the eigenvector of
/
with the largest eigenvalue
(so that š . /1¤ ‡¥£ ). If the initial error p m 0 n , expressed as a linear combination of eigenvectors, includes a
component in the direction of †£ , this component will be the slowest to converge.
Page 18
hidden
12 Jonathan Richard Shewchuk
-4 -2 2 4 6
-6
-4
-2
2
4
7
2

1

2
Figure 12: The eigenvectors of T are directed along the axes of the paraboloid defined by the quadratic
form N<OQPSR . Each eigenvector is labeled with its associated eigenvalue. Each eigenvalue is proportional to
the steepness of the corresponding slope.
/
is not generally symmetric (even if  is), and may even be defective. However, the rate of convergence
of the Jacobi Method depends largely on š .
/71
,which depends on

. Unfortunately, Jacobi does not converge
for every

, or even for every positive-definite

.
5.3. A Concrete Example
To demonstrate these ideas, I shall solve the system specified by Equation 4. First, we need a method of
finding eigenvalues and eigenvectors. By definition, for any eigenvector † with eigenvalue ‡ ,

†

‡ffi†

‡'”¦†
.§‡'”
*YG1
†

0
Eigenvectors are nonzero, so ‡ffi”
*Y
must be singular. Then,
det .§‡'”
*?G1c
0
The determinant of ‡ffi”
*¨
is called the characteristic polynomial. It is a degree polynomial in ‡
whose roots are the set of eigenvalues. The characteristic polynomial of
 (from Equation 4) is
det
B
‡
*
3
*
2
*
2 ‡
*
6 D

‡
2 * 9 ‡
)
14

.§‡
*
7
1
.§‡
*
2
1
E
and the eigenvalues are 7 and 2. To find the eigenvector associated with ‡

7,
.§‡ffi”
*YG1
†
CB 4
*
2
*
2 1 D
B
† 1
† 2 D

0
¢ 4 † 1
*
2 † 2

0
Page 20
hidden
14 Jonathan Richard Shewchuk
-4 -2 2 4 6
-6
-4
-2
2
4

1
(a)

2
*
0 470 47
-4 -2 2 4 6
-6
-4
-2
2
4

1
(b)

2
 m
0 n

-4 -2 2 4 6
-6
-4
2
4

1
(c)

2
p
m
0 n
-4 -2 2 4 6
-6
-4
-2
2
4

1
(d)

2
p
m
1 n
-4 -2 2 4 6
-6
-4
-2
2
4

1
(e)

2
p
m
2 n
-4 -2 2 4 6
-6
-4
-2
2
4

1
(f)

2
Figure 13: Convergence of the Jacobi Method. (a) The eigenvectors of  are shown with their corresponding
eigenvalues. Unlike the eigenvectors of T , these eigenvectors are not the axes of the paraboloid. (b) The
Jacobi Method starts at ~ff 2 €9 2 ‚¬ƒ and converges at ~ 2 € 2 ‚„ƒ . (c, d, e) The error vectors ­¯® 0 ° , ­¯® 1 ° , ­¯® 2 ° (solid
arrows) and their eigenvector components (dashed arrows). (f) Arrowheads represent the eigenvector
components of the first four error vectors. Each eigenvector component of the error is converging to zero at
the expected rate based on its eigenvalue.
Page 21
hidden
Convergence Analysis of Steepest Descent 15
-4 -2 2 4 6
-6
-4
-2
2
4

1

2
Figure 14: Steepest Descent converges to the exact solution on the first iteration if the error term is an
eigenvector.
symmetric, there exists a set of orthogonal eigenvectors of

. As we can scale eigenvectors arbitrarily,
let us choose so that each eigenvector is of unit length. This choice gives us the useful property that
†

£
†²±
´³ 1
E µ


E0
E µ
l



(17)
Express the error term as a linear combination of eigenvectors
p
m
"
n


·
£
#
1 ¸
£†a£
E
(18)
where
¸
£ is the length of each component of p
m
"
n
. From Equations 17 and 18 we have the following identities:
r
m
"
n
 *
p
m
"
n
+*
·
£
¸
£¯‡¥£a†a£
E
(19)
¹
p
m
"
n
¹ 2 
p

m
"
n
p
m
"
n

·
£
¸
2
£
E
(20)
p

m
"
n

p
m
"
n

.
·
£
¸
£†

£
1
.
·
£
¸
£‡¥£º†a£
1

·
£
¸
2
£
ണ
E
(21)
¹
r
m
"
n
¹ 2 
r

m
"
n
r
m
"
n

·
£
¸
2
£
‡
2
£
E
(22)
r

m
"
n

r
m
"
n

·
£
¸
2
£
‡
3
£
(23)
Page 22
hidden
16 Jonathan Richard Shewchuk
-4 -2 2 4 6
-4
-2
2
4
6

1

2
Figure 15: Steepest Descent converges to the exact solution on the first iteration if the eigenvalues are all
equal.
Equation 19 shows that r
m
"
n
too can be expressed as a sum of eigenvector components, and the length of
these components are
*
¸
£¯‡&£ . Equations 20 and 22 are just Pythagoras’ Law.
Now we can proceed with the analysis. Equation 12 gives
p
m
"$…
1 n

p
m
"
n
)
r

m
"
n
r
m
"
n
r

m
"
n

r
m
"
n
r
m
"
n

p
m
"
n
)
!
£
¸
2
£
‡
2
£
!
£
¸
2
£
‡
3
£
r
m
"
n
(24)
We saw in the last example that, if p
m
"
n
has only one eigenvector component, then convergence is
achieved in one step by choosing
y
m
"
n

‡
8 1
«
. Now let’s examine the case where p
m
"
n
is arbitrary, but all the
eigenvectors have a common eigenvalue ‡ . Equation 24 becomes
p
m
"$…
1 n

p
m
"
n
)
‡
2
!
£
¸
2
£
‡
3
!
£
¸
2
£
.
*
‡'p
m
"
n
1

0
Figure 15 demonstrates why, once again, there is instant convergence. Because all the eigenvalues are
equal, the ellipsoid is spherical; hence, no matter what point we start at, the residual must point to the center
of the sphere. As before, choose
y
m
"
n

‡
8 1
.
Page 24
hidden
18 Jonathan Richard Shewchuk
minimizing
;
.
 m
"
n
1
. With this norm, we have
¹
p
m
"$…
1 n
¹ 2» 
p

m
"$…
1 n

p
m
"$…
1 n

.¼p

m
"
n
)ty
m
"
n
r

m
"
n
1—
.¼p
m
"
n
)Ay
m
"
n
r
m
"
n
1 (by Equation 12)

p

m
"
n

p
m
"
n
)
2
y
m
"
n
r

m
"
n

p
m
"
n
)Ay
2m
"
n
r

m
"
n

r
m
"
n
(by symmetry of  )

¹
p
m
"
n
¹ 2»
)
2
r

m
"
n
r
m
"
n
r

m
"
n

r
m
"
n
¾
*
r

m
"
n
r
m
"
ng¿
)wÀ
r

m
"
n
r
m
"
n
r

m
"
n

r
m
"
nÁ
2
r

m
"
n

r
m
"
n

¹
p
m
"
n
¹ 2» * .Âr

m
"
n
r
m
"
n
1 2
r

m
"
n

r
m
"
n

¹
p
m
"
n
¹ 2»
À
1
*
.Âr

m
"
n
r
m
"
n
1 2
.Âr

m
"
n

r
m
"
n
1
.¼p

m
"
n

p
m
"
n
1
Á

¹
p
m
"
n
¹ 2»
À
1
*
.
!
£
¸
2
£
‡
2
£
1 2
.
!
£
¸
2
£
‡
3
£
1
.
!
£
¸
2
£
ണ
1
Á
(by Identities 21, 22, 23)

¹
p
m
"
n
¹ 2»(Ã 2
E
à 2  1
*
.
!
£
¸
2
£
‡
2
£
1 2
.
!
£
¸
2
£
‡
3
£
1
.
!
£
¸
2
£
ണ
1
(25)
The analysis depends on finding an upper bound for à . To demonstrate how the weights and eigenvalues
affect convergence, I shall derive a result for

2. Assume that ‡ 1 Ä ‡ 2. The spectral condition number
of

is defined to be Å

‡ 1 ª²‡ 2 Ä 1. The slope of p
m
"
n
(relative to the coordinate system defined by the
eigenvectors), which depends on the starting point, is denoted Æ 
¸
2 ª
¸
1. We have
à 2  1
*
.
¸
2
1 ‡
2
1 )
¸
2
2 ‡
2
2
1 2
.
¸
2
1 ‡ 1 )
¸
2
2 ‡ 2
1
.
¸
2
1 ‡
3
1 )
¸
2
2 ‡
3
2
1

1
*
.¼Å
2
)
Æ
2 1 2
.¼Å
)
Æ
2 1
.¼Å
3
)
Æ
2 1 (26)
The value of à , which determines the rate of convergence of Steepest Descent, is graphed as a function
of Æ and Å in Figure 17. The graph confirms my two examples. If p
m
0 n is an eigenvector, then the slope Æ
is zero (or infinite); we see from the graph that à is zero, so convergence is instant. If the eigenvalues are
equal, then the condition number Šis one; again, we see that à is zero.
Figure 18 illustrates examples from near each of the four corners of Figure 17. These quadratic forms
are graphed in the coordinate system defined by their eigenvectors. Figures 18(a) and 18(b) are examples
with a large condition number. Steepest Descent can converge quickly if a fortunate starting point is chosen
(Figure 18(a)), but is usually at its worst when Å is large (Figure 18(b)). The latter figure gives us our best
intuition for why a large condition number can be bad:
;
.
<1
forms a trough, and Steepest Descent bounces
back and forth between the sides of the trough while making little progress along its length. In Figures 18(c)
and 18(d), the condition number is small, so the quadratic form is nearly spherical, and convergence is quick
regardless of the starting point.
Holding Å constant (because  is fixed), a little basic calculus reveals that Equation 26 is maximized
when Æ
vÇ
Å . In Figure 17, one can see a faint ridge defined by this line. Figure 19 plots worst-case
starting points for our sample matrix

. These starting points fall on the lines defined by
¸
2 ª
¸
1
ÈÇ
Å . An
Page 25
hidden
Convergence Analysis of Steepest Descent 19
0
5
10
15
20
1
20
40
60
80
100
0
0.2
0.4
0.6
0.8
Æ
Å
Ã
Æ
Figure 17: Convergence É of Steepest Descent as a function of Ê (the slope of ­¯®
“
°
) and Ë (the condition
number of T ). Convergence is fast when Ê or Ë are small. For a fixed matrix, convergence is worst when
ÊV?Ì¤Ë .
-4 -2 2 4
-4
-2
2
4
6
-4 -2 2 4
-4
-2
2
4
6
-4 -2 2 4
-4
-2
2
4
6
-4 -2 2 4
-4
-2
2
4
6
† 1
(c)† 2
† 1
(d)† 2
† 1
(a)† 2
† 1
(b)† 2
Figure 18: These four examples represent points near the corresponding four corners of the graph in
Figure 17. (a) Large Ë , small Ê . (b) An example of poor convergence. Ë and Ê are both large. (c) Small Ë
and Ê . (d) Small Ë , large Ê .
Page 26
hidden
20 Jonathan Richard Shewchuk
-4 -2 2 4 6
-6
-4
-2
2
4

1

2
 m
0 n
Figure 19: Solid lines represent the starting points that give the worst convergence for Steepest Descent.
Dashed lines represent steps toward convergence. If the first iteration starts from a worst-case point, so do
all succeeding iterations. Each step taken intersects the paraboloid axes (gray arrows) at precisely a 45 Í
angle. Here, Ë5V 3 ‘ 5.
upper bound for à (corresponding to the worst-case starting points) is found by setting Æ 2  Å 2:
à 2 Π1
* 4 Å 4
Å
5
)
2 Å 4
)
Å
3

Å
5 * 2 Å 4
)
Å
3
Å
5
)
2 Å 4
)
Å
3

.¼Å
*
1
1 2
.¼Å
)
1
1 2
Ã
Î
Å
*
1
Å
)
1
(27)
Inequality 27 is plotted in Figure 20. The more ill-conditioned the matrix (that is, the larger its condition
number Å ), the slower the convergence of Steepest Descent. In Section 9.2, it is proven that Equation 27 is
also valid for
,
2, if the condition number of a symmetric, positive-definite matrix is defined to be
Å

‡ffiÏoÐ
`
ª²‡œÏ
"

E
the ratio of the largest to smallest eigenvalue. The convergence results for Steepest Descent are
¹
p
m
"
n
¹9»
ÎvÑ
Å
*
1
Å
)
1 Ò
"
¹
p
m
0 n
¹9»
E
and (28)
Page 28
hidden
22 Jonathan Richard Shewchuk
-4 -2 2 4 6
-6
-4
-2
2
4

1

2
(m
0 n

 m
1 n
p
m
1 n
Ô
m
0 n
Figure 21: The Method of Orthogonal Directions. Unfortunately, this method only works if you already know
the answer.
the direction of Ô
m
"
n
again. Using this condition, we have
Ô

m
"
n
p
m
"$…
1 n

0
Ô

m
"
n
.¼p
m
"
n
)¡y
m
"
n
Ô
m
"
n
1Õ
0 (by Equation 29)
y
m
"
n
 *
Ô

m
"
n
p
m
"
n
Ô

m
"
n
Ô
m
"
n
(30)
Unfortunately, we haven’t accomplished anything, because we can’t compute
y
m
"
n
without knowing p
m
"
n
;
and if we knew p
m
"
n
, the problem would already be solved.
The solution is to make the search directions

-orthogonal instead of orthogonal. Two vectors Ô
m
"
n
and
Ô
m
£
n
are

-orthogonal, or conjugate, if
Ô

m
"
n

Ô
m
£
n

0
Figure 22(a) shows what  -orthogonal vectors look like. Imagine if this article were printed on bubble
gum, and you grabbed Figure 22(a) by the ends and stretched it until the ellipses appeared circular. The
vectors would then appear orthogonal, as in Figure 22(b).
Our new requirement is that p
m
"$…
1 n be

-orthogonal to Ô
m
"
n
(see Figure 23(a)). Not coincidentally, this
orthogonality condition is equivalent to finding the minimum point along the search direction Ô
m
"
n
, as in
Page 30
hidden
24 Jonathan Richard Shewchuk
-4 -2 2 4 6
-6
-4
-2
2
4

1

2
 m
0 n

 m
1 n p
m
1 n
Ô
m
0 n
(a)
-4 -2 2 4 6
-6
-4
-2
2
4

1

2
p
m
0 n
(b)
Figure 23: The method of Conjugate Directions converges in Ö steps. (a) The first step is taken along some
direction ×¥® 0 ° . The minimum point P'® 1 ° is chosen by the constraint that ­Ø® 1 ° must be T -orthogonal to צ® 0 ° . (b)
The initial error ­Ø® 0 ° can be expressed as a sum of T -orthogonal components (gray arrows). Each step of
Conjugate Directions eliminates one of these components.
To prove that this procedure really does compute

in steps, express the error term as a linear
combination of search directions; namely,
p
m
0 n


8 1
·
£
#
0 Ù
£¯Ô
m
£
n
(33)
The values of
Ù
£ can be found by a mathematical trick. Because the search directions are

-orthogonal, it
is possible to eliminate all the
Ù
£ values but one from Expression 33 by premultiplying the expression by
Ô

m
±
n

:
Ô

m
±
n

p
m
0 n

·
£
Ù
m
£
n
Ô

m
±
n

Ô
m
£
n
Ô

m
±
n

p
m
0 n

Ù
m
±
n
Ô

m
±
n

Ô
m
±
n
(by  -orthogonality of Ô vectors)
Ù
m
±
n

Ô

m
±
n

p
m
0 n
Ô

m
±
n

Ô
m
±
n

Ô

m
±
n

.¼p
m
0 n ) !
±
8 1
"$#
0 y
m
"
n
Ô
m
"
n
1
Ô

m
±
n

Ô
m
±
n
(by  -orthogonality of Ô vectors)

Ô

m
±
n

p
m
±
n
Ô

m
±
n

Ô
m
±
n
(By Equation 29). (34)
By Equations 31 and 34, we find that
y
m
"
n
Ú*
Ù
m
"
n
. This fact gives us a new way to look at the error
term. As the following equation shows, the process of building up

component by component can also be
Page 31
hidden
The Method of Conjugate Directions 25
viewed as a process of cutting down the error term component by component (see Figure 23(b)).
p
m
"
n

p
m
0 n )
"
8 1
·
£
#
0
y
m
£
n
Ô
m
£
n


8 1
·
£
#
0
Ù
m
£
n
Ô
m
£
n
*
"
8 1
·
£
#
0
Ù
m
£
n
Ô
m
£
n


8 1
·
£
#^"
Ù
m
£
n
Ô
m
£
n
(35)
After iterations, every component is cut away, and p
m
²n

0; the proof is complete.
7.2. Gram-Schmidt Conjugation
All that is needed now is a set of

-orthogonal search directions •¯Ô
m
"
n
–
. Fortunately, there is a simple way
to generate them, called a conjugate Gram-Schmidt process.
Suppose we have a set of linearly independent vectors Û 0
E
Û 1
E

E
Û

8 1. The coordinate axes will
do in a pinch, although more intelligent choices are possible. To construct Ô
m
"
n
, take Û " and subtract out
any components that are not

-orthogonal to the previous Ô vectors (see Figure 24). In other words, set
Ô
m
0 n

Û 0, and for Œ
,
0, set
Ô
m
"
n

Û
"
)
"
8 1
·
±
#
0 Ü
"
±ØÔ
m
±
n
E
(36)
where the
Ü
"
± are defined for Œ
,ݶ
. To find their values, use the same trick used to find
Ù
£ :
Ô

m
"
n

Ô
m
£
n

Û

"

Ô
m
£
n
)
"
8 1
·
±
#
0
Ü
"
±ØÔ

m
±
n

Ô
m
£
n
0

Û

"

Ô
m
£
n
)
Ü
"
£¯Ô

m
£
n

Ô
m
£
n
E
Œ
,
µ
(by  -orthogonality of Ô vectors)
Ü
"
£
 *
Û

"

Ô
m
£
n
Ô

m
£
n

Ô
m
£
n
(37)
The difficulty with using Gram-Schmidt conjugation in the method of Conjugate Directions is that all
the old search vectors must be kept in memory to construct each new one, and furthermore Þ. 3
1
operations


u
u
u
u
+
*
dß0à
1
(0)á (0)á
(1)á
Figure 24: Gram-Schmidt conjugation of two vectors. Begin with two linearly independent vectors â 0 and
â 1. Set ×¥® 0 ° Vkâ 0. The vector â 1 is composed of two components: âffiã , which is T -orthogonal (or conjugate)
to ×¥® 0 ° , and âSä , which is parallel to צ® 0 ° . After conjugation, only the T -orthogonal portion remains, and
×¥® 1 ° Vkâœã .
Page 32
hidden
26 Jonathan Richard Shewchuk
-4 -2 2 4 6
-6
-4
-2
2
4

1

2
Figure 25: The method of Conjugate Directions using the axial unit vectors, also known as Gaussian
elimination.
are required to generate the full set. In fact, if the search vectors are constructed by conjugation of the axial
unit vectors, Conjugate Directions becomes equivalent to performing Gaussian elimination (see Figure 25).
As a result, the method of Conjugate Directions enjoyed little use until the discovery of CG — which is a
method of Conjugate Directions — cured these disadvantages.
An important key to understanding the method of Conjugate Directions (and also CG) is to notice
that Figure 25 is just a stretched copy of Figure 21! Remember that when one is performing the method
of Conjugate Directions (including CG), one is simultaneously performing the method of Orthogonal
Directions in a stretched (scaled) space.
7.3. Optimality of the Error Term
Conjugate Directions has an interesting property: it finds at every step the best solution within the bounds
of where it’s been allowed to explore. Where has it been allowed to explore? Let å " be the Œ -dimensional
subspace span •¯Ô
m
0 n
E
Ô
m
1 n
E

E
Ô
m
"
8 1 n
– ; the value p
m
"
n
is chosen from p
m
0 n ) å
"
. What do I mean by “best
solution”? I mean that Conjugate Directions chooses the value from p m 0 n ) å " that minimizes
¹
p
m
"
n
¹9» (see
Figure 26). In fact, some authors derive CG by trying to minimize ¹ p m "
n
¹9»
within p
m
0 n ) å
"
.
In the same way that the error term can be expressed as a linear combination of search directions
Page 33
hidden
The Method of Conjugate Directions 27

dß (0)á
(1)á
e (2)
á
e(0)á (1)á
e

Figure 26: In this figure, the shaded area is ­Ø® 0 °&ç:è 2 Vk­¯® 0 °Sç span éê×¥® 0 ° €g×¥® 1 °—ë . The ellipsoid is a contour on
which the energy norm is constant. After two steps, Conjugate Directions finds ­¯® 2 ° , the point on ­Ø® 0 °œçè 2
that minimizes ì|­¦ì}í .
(Equation 35), its energy norm can be expressed as a summation.
¹
p
m
"
n
¹9»


8 1
·
£
#^"

8 1
·
±
#^"
Ù
m
£
n
Ù
m
±
n
Ô

m
£
n

Ô
m
±
n
(by Equation 35)


8 1
·
£
#^"
Ù
2m
£
n
Ô

m
£
n

Ô
m
£
n
(by  -orthogonality of Ô vectors).
Each term in this summation is associated with a search direction that has not yet been traversed. Any other
vector p chosen from p
m
0 n ) å
"
must have these same terms in its expansion, which proves that p
m
"
n
must
have the minimum energy norm.
Having proven the optimality with equations, let’s turn to the intuition. Perhaps the best way to visualize
the workings of Conjugate Directions is by comparing the space we are working in with a “stretched” space,
as in Figure 22. Figures 27(a) and 27(c) demonstrate the behavior of Conjugate Directions in ˜ 2 and ˜ 3 ;
lines that appear perpendicular in these illustrations are orthogonal. On the other hand, Figures 27(b) and
27(d) show the same drawings in spaces that are stretched (along the eigenvector axes) so that the ellipsoidal
contour lines become spherical. Lines that appear perpendicular in these illustrations are

-orthogonal.
In Figure 27(a), the Method of Conjugate Directions begins at  m 0 n , takes a step in the direction of Ô
m
0 n ,
and stops at the point
(m
1 n , where the error vector p
m
1 n is

-orthogonal to Ô
m
0 n . Why should we expect
this to be the minimum point on

m
0 n ) å 1? The answer is found in Figure 27(b): in this stretched space,
p
m
1 n appears perpendicular to Ô
m
0 n because they are

-orthogonal. The error vector p
m
1 n is a radius of the
concentric circles that represent contours on which
¹
p
¹9»
is constant, so

m
0 n ) å 1 must be tangent at

m
1 n
to the circle that
 m
1 n lies on. Hence,
(m
1 n is the point on
(m
0 n ) å 1 that minimizes
¹
p
m
1 n
¹9»
.
This is not surprising; we have already seen in Section 7.1 that

-conjugacy of the search direction and
the error term is equivalent to minimizing
; (and therefore ¹ p ¹9» ) along the search direction. However,
after Conjugate Directions takes a second step, minimizing ¹ p ¹9» along a second search direction Ô m 1 n , why
should we expect that
¹
p
¹9»
will still be minimized in the direction of Ô
m
0 n ? After taking Œ steps, why will
;
.

m
"
n
1
be minimized over all of

m
0 n ) å
" ?
Page 34
hidden
28 Jonathan Richard Shewchuk
(m
1 n

r
m
1 n
Ô
m
0 n
Ô
m
1 n
p
m
1 n
 m
0 n
 m
0 n ) å 1
(a)

r
m
1 n
 m
1 n
 m
0 n
Ô
m
0 n
Ô
m
1 n
p
m
1 n
 m
0 n ) å 1
(b)


m
1 n
 m
2 n
Ô
m
0 n
(m
0 n ) å 2

m
0 n
r
m
1 n
Ô
m
1 n

m
0 n ) å 1
(c)

(m
2 n
Ô
m
1 n
 m
1 n
Ô
m
0 n
(m
0 n
r
m
1 n (m
0 n ) å 2

m
0 n ) å 1
(d)
Figure 27: Optimality of the Method of Conjugate Directions. (a) A two-dimensional problem. Lines that
appear perpendicular are orthogonal. (b) The same problem in a “stretched” space. Lines that appear
perpendicular are T -orthogonal. (c) A three-dimensional problem. Two concentric ellipsoids are shown; P
is at the center of both. The line P'® 0 °îçfè 1 is tangent to the outer ellipsoid at Pffi® 1 ° . The plane P'® 0 °îçfè 2 is
tangent to the inner ellipsoid at Pffi® 2 ° . (d) Stretched view of the three-dimensional problem.
Page 35
hidden
The Method of Conjugate Directions 29
In Figure 27(b), Ô m 0 n and Ô
m
1 n appear perpendicular because they are

-orthogonal. It is clear that Ô
m
1 n
must point to the solution

, because Ô
m
0 n is tangent at
 m
1 n to a circle whose center is

. However, a
three-dimensional example is more revealing. Figures 27(c) and 27(d) each show two concentric ellipsoids.
The point
 m
1 n lies on the outer ellipsoid, and
 m
2 n lies on the inner ellipsoid. Look carefully at these figures:
the plane
 m
0 n ) å 2 slices through the larger ellipsoid, and is tangent to the smaller ellipsoid at
 m
2 n . The
point

is at the center of the ellipsoids, underneath the plane.
Looking at Figure 27(c), we can rephrase our question. Suppose you and I are standing at (m 1 n , and
want to walk to the point that minimizes
¹
p
¹
on
 m
0 n ) å 2; but we are constrained to walk along the search
direction Ô
m
1 n . If Ô
m
1 n points to the minimum point, we will succeed. Is there any reason to expect that Ô
m
1 n
will point the right way?
Figure 27(d) supplies an answer. Because Ô m 1 n is

-orthogonal to Ô
m
0 n , they are perpendicular in this
diagram. Now, suppose you were staring down at the plane
 m
0 n ) å 2 as if it were a sheet of paper; the
sight you’d see would be identical to Figure 27(b). The point  m 2 n would be at the center of the paper,
and the point

would lie underneath the paper, directly under the point
 m
2 n . Because Ô
m
0 n and Ô
m
1 n are
perpendicular, Ô
m
1 n points directly to
 m
2 n , which is the point in
 m
0 n ) å 2 closest to

. The plane
 m
0 n ) å 2
is tangent to the sphere on which

m
2 n lies. If you took a third step, it would be straight down from

m
2 n to

, in a direction

-orthogonal to å 2.
Another way to understand what is happening in Figure 27(d) is to imagine yourself standing at the
solution point

, pulling a string connected to a bead that is constrained to lie in

m
0 n ) å
"
. Each time the
expanding subspace å is enlarged by a dimension, the bead becomes free to move a little closer to you.
Now if you stretch the space so it looks like Figure 27(c), you have the Method of Conjugate Directions.
Another important property of Conjugate Directions is visible in these illustrations. We have seen that,
at each step, the hyperplane

m
0 n ) å
" is tangent to the ellipsoid on which

m
"
n
lies. Recall from Section 4
that the residual at any point is orthogonal to the ellipsoidal surface at that point. It follows that r
m
"
n
is
orthogonal to å " as well. To show this fact mathematically, premultiply Equation 35 by
*
Ô

m
"
n

:
*
Ô

m
"
n

p
m
£
n
 *

8 1
·
£
#^"
Ù
m
£
n
Ô

m
"
n

Ô
m
£
n
(38)
Ô

m
"
n
r
m
£
n

0
E
Œo‹
µ
(by  -orthogonality of Ô -vectors). (39)
We could have derived this identity by another tack. Recall that once we take a step in a search direction,
we need never step in that direction again; the error term is evermore

-orthogonal to all the old search
directions. Because r
m
"
n
u*
p
m
"
n
, the residual is evermore orthogonal to all the old search directions.
Because the search directions are constructed from the Û vectors, the subspace spanned by Û 0
E

E
Û
"
8 1
is å " , and the residual r
m
"
n
is orthogonal to these previous Û vectors as well (see Figure 28). This is proven
by taking the inner product of Equation 36 and r
m
£
n
:
Ô

m
"
n
r
m
£
n

Û

"
r
m
£
n
)
"
8 1
·
±
#
0
Ü
"
±ïÔ

m
±
n
r
m
£
n
(40)
0

Û

"
r
m
£
n
E
ŒU‹
µ
(by Equation 39) (41)
There is one more identity we will use later. From Equation 40 (and Figure 28),
Ô

m
"
n
r
m
"
n

Û

"
r
m
"
n
(42)
Page 36
hidden
30 Jonathan Richard Shewchuk



r
(0)á
(1)á
(2)á (2)
á
u
u1 0à
u2
e
(2)á
Figure 28: Because the search directions ×¥® 0 ° €g×¥® 1 ° are constructed from the vectors â 0 €2â 1, they span the
same subspace
è
2 (the gray-colored plane). The error term ­Ø® 2 ° is T -orthogonal to è 2, the residual 𯮠2 ° is
orthogonal to
è
2, and a new search direction צ® 2 ° is constructed (from â 2) to be T -orthogonal to è 2. The
endpoints of â 2 and ×¥® 2 ° lie on a plane parallel to è 2, because ×¥® 2 ° is constructed from â 2 by Gram-Schmidt
conjugation.
To conclude this section, I note that as with the method of Steepest Descent, the number of matrix-vector
products per iteration can be reduced to one by using a recurrence to find the residual:
r
m
"$…
1 n
 *

p
m
"$…
1 n
 *

.¼p
m
"
n
)¡y
m
"
n
Ô
m
"
n
1

r
m
"
n
*
y
m
"
n

Ô
m
"
n
(43)
8. The Method of Conjugate Gradients
It may seem odd that an article about CG doesn’t describe CG until page 30, but all the machinery is now in
place. In fact, CG is simply the method of Conjugate Directions where the search directions are constructed
by conjugation of the residuals (that is, by setting Û "  r m "
n
).
This choice makes sense for many reasons. First, the residuals worked for Steepest Descent, so why not
for Conjugate Directions? Second, the residual has the nice property that it’s orthogonal to the previous
search directions (Equation 39), so it’s guaranteed always to produce a new, linearly independent search
direction unless the residual is zero, in which case the problem is already solved. As we shall see, there is
an even better reason to choose the residual.
Let’s consider the implications of this choice. Because the search vectors are built from the residuals, the
subspace span •ar
m
0 n
E
r
m
1 n
E

E
r
m
"
8 1 n
– is equal to å " . As each residual is orthogonal to the previous search
directions, it is also orthogonal to the previous residuals (see Figure 29); Equation 41 becomes
r

m
"
n
r
m
£
n

0
E
Œ
l

µ
(44)
Interestingly, Equation 43 shows that each new residual r
m
"
n
is just a linear combination of the previous
residual and

Ô
m
"
8 1 n . Recalling that Ô
m
"
8 1 nFñ å
"
, this fact implies that each new subspace å "„… 1 is formed
from the union of the previous subspace å " and the subspace

å
"
. Hence,
å
"

span •¯Ô
m
0 n
E

Ô
m
0 n
E
 2
Ô
m
0 n
E

E

"
8 1
Ô
m
0 n
–

span •ar
m
0 n
E

r
m
0 n
E
 2
r
m
0 n
E

E

"
8 1
r
m
0 n
–

Page 40
hidden
34 Jonathan Richard Shewchuk
2 7
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
2 7
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
2 7
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
2 7
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
‡
(c) 2 .§‡ 1
‡
(d) 2 .§‡ 1
‡
(a) 0 .§‡ 1
‡
(b) 1 .§‡ 1
Figure 31: The convergence of CG after ’ iterations depends on how close a polynomial 
“
of degree ’ can
be to zero on each eigenvalue, given the constraint that 
“
O 0 R V 1.
CG finds the polynomial that minimizes this expression, but convergence is only as good as the convergence
of the worst eigenvector. Letting Λ .
1
be the set of eigenvalues of

, we have
¹
p
m
"
n
¹ 2» Î min
ø
max


Λ
m
»
n
J

"
.§‡
1
L
2 ·
£
¸
2
£
‡&£

min

ø
max


Λ
m
»
n
J

"
.§‡
1
L
2 ¹
p
m
0 n
¹ 2»
(50)
Figure 31 illustrates, for several values of Œ , the  " that minimizes this expression for our sample problem
with eigenvalues 2 and 7. There is only one polynomial of degree zero that satisfies  0 . 0
1
1, and that
is  0 .§‡
1d
1, graphed in Figure 31(a). The optimal polynomial of degree one is  1 .§‡
1
1
*
2

ª 9,
graphed in Figure 31(b). Note that  1 . 2
16
5 ª 9 and  1 . 7
1F´*
5 ª 9, and so the energy norm of the error
term after one iteration of CG is no greater than 5 ª 9 its initial value. Figure 31(c) shows that, after two
iterations, Equation 50 evaluates to zero. This is because a polynomial of degree two can be fit to three
points (  2 . 0
1s
1,  2 . 2
1¤
0, and  2 . 7
1¤
0). In general, a polynomial of degree can fit
)
1 points,
and thereby accommodate separate eigenvalues.
The foregoing discussing reinforces our understanding that CG yields the exact result after iterations;
and furthermore proves that CG is quicker if there are duplicated eigenvalues. Given infinite floating point
precision, the number of iterations required to compute an exact solution is at most the number of distinct
eigenvalues. (There is one other possibility for early termination:  m 0 n may already be

-orthogonal to
some of the eigenvectors of

. If eigenvectors are missing from the expansion of

m
0 n , their eigenvalues
may be omitted from consideration in Equation 50. Be forewarned, however, that these eigenvectors may
be reintroduced by floating point roundoff error.)
Page 41
hidden
Convergence Analysis of Conjugate Gradients 35
-1 -0.5 0.5 1
-2
-1.5
-1
-0.5
0.5
1
1.5
2
-1 -0.5 0.5 1
-2
-1.5
-1
-0.5
0.5
1
1.5
2
-1 -0.5 0.5 1
-2
-1.5
-1
-0.5
0.5
1
1.5
2
-1 -0.5 0.5 1
-2
-1.5
-1
-0.5
0.5
1
1.5
2
Ã

10 .
Ã
1
Ã

49 .
Ã
1
Ã

2 .
Ã
1
Ã

5 .
Ã
1
Figure 32: Chebyshev polynomials of degree 2, 5, 10, and 49.
We also find that CG converges more quickly when eigenvalues are clustered together (Figure 31(d))
than when they are irregularly distributed between ‡ffiÏ "

and ‡ffiÏoÐ ` , because it is easier for CG to choose a
polynomial that makes Equation 50 small.
If we know something about the characteristics of the eigenvalues of

, it is sometimes possible to
suggest a polynomial that leads to a proof of a fast convergence. For the remainder of this analysis, however,
I shall assume the most general case: the eigenvalues are evenly distributed between ‡ffiÏ "

and ‡œÏsÐ ` , the
number of distinct eigenvalues is large, and floating point roundoff occurs.
9.2. Chebyshev Polynomials
A useful approach is to minimize Equation 50 over the range
J
‡œÏ
"

E
‡ffiÏoÐ
`
L rather than at a finite number of
points. The polynomials that accomplish this are based on Chebyshev polynomials.
The Chebyshev polynomial of degree Œ is

"
.
Ã
1c 1
2 
.
Ã
)
à 2 * 1
1
"
)
.
Ã
*

à 2 * 1
1
"

(If this expression doesn’t look like a polynomial to you, try working it out for Œ equal to 1 or 2.) Several
Chebyshev polynomials are illustrated in Figure 32. The Chebyshev polynomials have the property that
ˆ

"
.
Ã
1
ˆ
Î 1 (in fact, they oscillate between 1 and * 1) on the domain Ã
ñ
Jx*
1
E
1 L , and furthermore that
ˆ

"
.
Ã
1
ˆ is maximum on the domain à l
ñ
Jx*
1
E
1 L among all such polynomials. Loosely speaking, ˆ
" . Ã
1
ˆ
increases as quickly as possible outside the boxes in the illustration.
Page 42
hidden
36 Jonathan Richard Shewchuk
1 2 3 4 5 6 7 8
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
‡
 2 .§‡
1
Figure 33: The polynomial  2 O\›SR that minimizes Equation 50 for ›
“
V 2 and ›V 7 in the general case.
This curve is a scaled version of the Chebyshev polynomial of degree 2. The energy norm of the error term
after two iterations is no greater than 0.183 times its initial value. Compare with Figure 31(c), where it is
known that there are only two eigenvalues.
It is shown in Appendix C3 that Equation 50 is minimized by choosing

"
.§‡
1c

"
¾

fiffffifl
…



ø
b
8 2



fiffffifl
8



ø
b
¿

"
¾


ffffifl
…



ø
b


ffffifl
8



ø
b
¿

This polynomial has the oscillating properties of Chebyshev polynomials on the domain ‡ffiÏ "

Î
‡
Î
‡ffiÏoÐ
`
(see Figure 33). The denominator enforces our requirement that  " . 0 1c 1. The numerator has a maximum
value of one on the interval between ‡ffiÏ "

and ‡ffiÏoÐ ` , so from Equation 50 we have
¹
p
m
"
n
¹
»
Î

"
Ñ
‡ffiÏoÐ
`
)
‡œÏ
"

‡ffiÏoÐ
`
*
‡œÏ
"

Ò
8 1
¹
p
m
0 n
¹
»


"
Ñ
Å
)
1
Å
*
1 Ò
8 1
¹
p
m
0 n
¹9»

2


À
©
Å
)
1
©
Å
*
1
Á
"
)wÀ
©
Å
*
1
©
Å
)
1
Á
"


8 1
¹
p
m
0 n
¹9»
(51)
The second addend inside the square brackets converges to zero as Œ grows, so it is more common to express
the convergence of CG with the weaker inequality
¹
p
m
"
n
¹9»
Î 2
À
©
Å
*
1
©
Å
)
1
Á
"
¹
p
m
0 n
¹9»
(52)
Page 43
hidden
Complexity 37
20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
Å
Ã
Figure 34: Convergence of Conjugate Gradients (per iteration) as a function of condition number. Compare
with Figure 20.
The first step of CG is identical to a step of Steepest Descent. Setting Œ

1 in Equation 51, we obtain
Equation 28, the convergence result for Steepest Descent. This is just the linear polynomial case illustrated
in Figure 31(b).
Figure 34 charts the convergence per iteration of CG, ignoring the lost factor of 2. In practice, CG
usually converges faster than Equation 52 would suggest, because of good eigenvalue distributions or good
starting points. Comparing Equations 52 and 28, it is clear that the convergence of CG is much quicker
than that of Steepest Descent (see Figure 35). However, it is not necessarily true that every iteration of CG
enjoys faster convergence; for example, the first iteration of CG is an iteration of Steepest Descent. The
factor of 2 in Equation 52 allows CG a little slack for these poor iterations.
10. Complexity
The dominating operations during an iteration of either Steepest Descent or CG are matrix-vector products.
In general, matrix-vector multiplication requires Þ.Âþ
1
operations, where þ is the number of non-zero
entries in the matrix. For many problems, including those listed in the introduction,

is sparse and
þ
ñ
Þ.Â
1
.
Suppose we wish to perform enough iterations to reduce the norm of the error by a factor of ; that is,
¹
p
m
"
n
¹
Î

¹
p
m
0 n
¹
. Equation 28 can be used to show that the maximum number of iterations required to
achieve this bound using Steepest Descent is
Œ
Î"!
1
2
Å ln Ñ 1#
Ò$
E
whereas Equation 52 suggests that the maximum number of iterations CG requires is
Œ
Î%!
1
2
©
Å ln Ñ 2#
Ò$

Page 44
hidden
38 Jonathan Richard Shewchuk
200 400 600 800 1000
0
5
10
15
20
25
30
35
40
Å
Ã
Figure 35: Number of iterations of Steepest Descent required to match one iteration of CG.
I conclude that Steepest Descent has a time complexity of Þ.ÂþfÅ
1
, whereas CG has a time complexity of
Þ.Âþ
©
Å
1
. Both algorithms have a space complexity of Þ.Âþ
1
.
Finite difference and finite element approximations of second-order elliptic boundary value problems
posed on Ô -dimensional domains often have Å
ñ
Þ.Â
2 ½
z
1
. Thus, Steepest Descent has a time complexity of
Þ.Â
2 1 for two-dimensional problems, versus Þ. 3 ½ 2
1
for CG; and Steepest Descent has a time complexity
of Þ. 5 ½ 3
1
for three-dimensional problems, versus Þ. 4 ½ 3
1
for CG.
11. Starting and Stopping
In the preceding presentation of the Steepest Descent and Conjugate Gradient algorithms, several details
have been omitted; particularly, how to choose a starting point, and when to stop.
11.1. Starting
There’s not much to say about starting. If you have a rough estimate of the value of

, use it as the starting
value

m
0 n . If not, set

m
0 n

0; either Steepest Descent or CG will eventually converge when used to solve
linear systems. Nonlinear minimization (coming up in Section 14) is trickier, though, because there may
be several local minima, and the choice of starting point will determine which minimum the procedure
converges to, or whether it will converge at all.
11.2. Stopping
When Steepest Descent or CG reaches the minimum point, the residual becomes zero, and if Equation 11
or 48 is evaluated an iteration later, a division by zero will result. It seems, then, that one must stop
Page 45
hidden
Preconditioning 39
immediately when the residual is zero. To complicate things, though, accumulated roundoff error in the
recursive formulation of the residual (Equation 47) may yield a false zero residual; this problem could be
resolved by restarting with Equation 45.
Usually, however, one wishes to stop before convergence is complete. Because the error term is not
available, it is customary to stop when the norm of the residual falls below a specified value; often, this
value is some small fraction of the initial residual ( ¹ r m "
n
¹
‹&
¹
r
m
0 n
¹ ). See Appendix B for sample code.
12. Preconditioning
Preconditioning is a technique for improving the condition number of a matrix. Suppose that ' is a
symmetric, positive-definite matrix that approximates

, but is easier to invert. We can solve

4
indirectly by solving
'
8 1 ™
'
8 1
(53)
If Å(.('
8 1 G1*)
Å(.
1
, or if the eigenvalues of '
8 1  are better clustered than those of

, we can iteratively
solve Equation 53 more quickly than the original problem. The catch is that '
8 1  is not generally
symmetric nor definite, even if ' and

are.
We can circumvent this difficulty, because for every symmetric, positive-definite ' there is a (not
necessarily unique) matrix ž that has the property that žž flu ' . (Such an ž can be obtained, for
instance, by Cholesky factorization.) The matrices ' 8 1  and ž 8 1  ž 8œfl have the same eigenvalues. This
is true because if † is an eigenvector of '
8 1  with eigenvalue ‡ , then ž

† is an eigenvector of ž
8 1 
ž
8œfl
with eigenvalue ‡ :
. ž
8 1 
ž
8œfl
1
. ž

†
1U
. ž

ž
8œfl
1
ž
8 1 
†

ž

'
8 1 
†

‡'ž

†

The system



can be transformed into the problem
ž
8 1 
ž
8œfl+
™
ž
8 1
E
+

ž


E
which we solve first for
+

, then for

. Because ž
8 1 
ž
8œfl
is symmetric and positive-definite,
+

can be
found by Steepest Descent or CG. The process of using CG to solve this system is called the Transformed
Preconditioned Conjugate Gradient Method:
+
Ô
m
0 n

+
r
m
0 n

ž
8 1 s*
ž
8 1 
ž
8œfl+
 m
0 n
E
y
m
"
n

+
r

m
"
n
+
r
m
"
n
+
Ô

m
"
n
ž
8 1 
ž
8œfl
+
Ô
m
"
n
E
+
 m
"$…
1 n

+
 m
"
n
)Ay
m
"
n
+
Ô
m
"
n
E
+
r
m
"$…
1 n

+
r
m
"
n
*
y
m
"
n
ž
8 1 
ž
8œfl
+
Ô
m
"
n
E
Ü
m
"„…
1 n

+
r

m
"$…
1 n
+
r
m
"$…
1 n
+
r

m
"
n
+
r
m
"
n
E
+
Ô
m
"$…
1 n

+
r
m
"$…
1 n )
Ü
m
"$…
1 n
+
Ô
m
"
n

Page 46
hidden
40 Jonathan Richard Shewchuk
-4 -2 2 4 6
-8
-6
-4
-2

1

2
Figure 36: Contour lines of the quadratic form of the diagonally preconditioned sample problem.
This method has the undesirable characteristic that ž must be computed. However, a few careful
variable substitutions can eliminate ž . Setting
+
r
m
"
n

ž
8 1
r
m
"
n
and
+
Ô
m
"
n

ž

Ô
m
"
n
, and using the identities
+

m
"
n

ž
fl 
m
"
n
and ž
8œfl
ž
8 1 
'
8 1
, we derive the Untransformed Preconditioned Conjugate Gradient
Method:
r
m
0 n

o*Y
 m
0 n
E
Ô
m
0 n

'
8 1
r
m
0 n
E
y
m
"
n

r

m
"
n
'
8 1
r
m
"
n
Ô

m
"
n

Ô
m
"
n
E

m
"$…
1 n
4
m
"
n
)Ay
m
"
n
Ô
m
"
n
E
r
m
"„…
1 n

r
m
"
n
*
y
m
"
n

Ô
m
"
n
E
Ü
m
"„…
1 n

r

m
"$…
1 n '
8 1
r
m
"$…
1 n
r

m
"
n
'
8 1
r
m
"
n
E
Ô
m
"$…
1 n

'
8 1
r
m
"$…
1 n )
Ü
m
"$…
1 n Ô
m
"
n

The matrix ž does not appear in these equations; only '
8 1 is needed. By the same means, it is possible
to derive a Preconditioned Steepest Descent Method that does not use ž .
The effectiveness of a preconditioner ' is determined by the condition number of '
8 1 
, and occa-
sionally by its clustering of eigenvalues. The problem remains of finding a preconditioner that approximates

well enough to improve convergence enough to make up for the cost of computing the product '
8 1
r
m
"
n
once per iteration. (It is not necessary to explicitly compute ' or ' 8 1; it is only necessary to be able to
compute the effect of applying '
8 1 to a vector.) Within this constraint, there is a surprisingly rich supply
of possibilities, and I can only scratch the surface here.
Page 48
hidden
42 Jonathan Richard Shewchuk
stability is improved if the value Ô
fl  fl 
Ô (in Equation 46) is computed by taking the inner product of  Ô
with itself.
14. The Nonlinear Conjugate Gradient Method
CG can be used not only to find the minimum point of a quadratic form, but to minimize any continuous
function
;
.
M1
for which the gradient
;
]
can be computed. Applications include a variety of optimization
problems, such as engineering design, neural net training, and nonlinear regression.
14.1. Outline of the Nonlinear Conjugate Gradient Method
To derive nonlinear CG, there are three changes to the linear algorithm: the recursive formula for the residual
cannot be used, it becomes more complicated to compute the step size
y
, and there are several different
choices for
Ü
.
In nonlinear CG, the residual is always set to the negation of the gradient; r
m
"
n
+*
;
]
.

m
"
n
1
. The search
directions are computed by Gram-Schmidt conjugation of the residuals as with linear CG. Performing a line
search along this search direction is much more difficult than in the linear case, and a variety of procedures
can be used. As with the linear CG, a value of
y
m
"
n
that minimizes
;
.

m
"
n
)¡y
m
"
n
Ô
m
"
n
1
is found by ensuring
that the gradient is orthogonal to the search direction. We can use any algorithm that finds the zeros of the
expression
J
;
]
.
 m
"
n
)Ay
m
"
n
Ô
m
"
n
1
L

Ô
m
"
n
.
In linear CG, there are several equivalent expressions for the value of
Ü
. In nonlinear CG, these different
expressions are no longer equivalent; researchers are still investigating the best choice. Two choices are the
Fletcher-Reeves formula, which we used in linear CG for its ease of computation, and the Polak-Ribie`re
formula:
Ü3214
m
"$…
1 n

r

m
"„…
1 n r
m
"$…
1 n
r

m
"
n
r
m
"
n
E
Ü

4
m
"$…
1 n

r

m
"$…
1 n .Âr
m
"$…
1 n
*
r
m
"
n
1
r

m
"
n
r
m
"
n

The Fletcher-Reeves method converges if the starting point is sufficiently close to the desired minimum,
whereas the Polak-Ribie`re method can, in rare cases, cycle infinitely without converging. However, Polak-
Ribie`re often converges much more quickly.
Fortunately, convergence of the Polak-Ribie`re method can be guaranteed by choosing
Ü

max •
Ü

4
E
0 – .
Using this value is equivalent to restarting CG if
Ü

4
‹ 0. To restart CG is to forget the past search
directions, and start CG anew in the direction of steepest descent.
Here is an outline of the nonlinear CG method:
Ô
m
0 n

r
m
0 n
+*
;
]
.

m
0 n
1
E
Find
y
m
"
n
that minimizes
;
.
(m
"
n
)ty
m
"
n
Ô
m
"
n
1
E

m
"$…
1 n
4
m
"
n
)Ay
m
"
n
Ô
m
"
n
E
r
m
"$…
1 n
+*
;
]
.

m
"$…
1 n
1
E
Ü
m
"$…
1 n

r

m
"$…
1 n r
m
"„…
1 n
r

m
"
n
r
m
"
n
or
Ü
m
"$…
1 n

max
³
r

m
"$…
1 n .Âr
m
"$…
1 n
*
r
m
"
n
1
r

m
"
n
r
m
"
n
E
0 5
E
Page 49
hidden
The Nonlinear Conjugate Gradient Method 43
Ô
m
"$…
1 n

r
m
"$…
1 n )
Ü
m
"$…
1 n Ô
m
"
n

Nonlinear CG comes with few of the convergence guarantees of linear CG. The less similar
;
is to a
quadratic function, the more quickly the search directions lose conjugacy. (It will become clear shortly that
the term “conjugacy” still has some meaning in nonlinear CG.) Another problem is that a general function
;
may have many local minima. CG is not guaranteed to converge to the global minimum, and may not
even find a local minimum if
;
has no lower bound.
Figure 37 illustrates nonlinear CG. Figure 37(a) is a function with many local minima. Figure 37(b)
demonstrates the convergence of nonlinear CG with the Fletcher-Reeves formula. In this example, CG is
not nearly as effective as in the linear case; this function is deceptively difficult to minimize. Figure 37(c)
shows a cross-section of the surface, corresponding to the first line search in Figure 37(b). Notice that there
are several minima; the line search finds a value of
y
corresponding to a nearby minimum. Figure 37(d)
shows the superior convergence of Polak-Ribie`re CG.
Because CG can only generate conjugate vectors in an -dimensional space, it makes sense to restart
CG every iterations, especially if is small. Figure 38 shows the effect when nonlinear CG is restarted
every second iteration. (For this particular example, both the Fletcher-Reeves method and the Polak-Ribie`re
method appear the same.)
14.2. General Line Search
Depending on the value of
;
]
, it might be possible to use a fast algorithm to find the zeros of
;
]

Ô . For
instance, if
;
] is polynomial in
y
, then an efficient algorithm for polynomial zero-finding can be used.
However, we will only consider general-purpose algorithms.
Two iterative methods for zero-finding are the Newton-Raphson method and the Secant method. Both
methods require that
;
be twice continuously differentiable. Newton-Raphson also requires that it be
possible to compute the second derivative of
;
.

)Ay
Ô
1
with respect to
y
.
The Newton-Raphson method relies on the Taylor series approximation
;
.

)Ay
Ô
176
;
.
M1
)Ay98
Ô
Ô
y
;
.

)ty
Ô
1;:
{
#
0
)
y
2
2
B
Ô
2
Ô
y
2
;
.

)Ay
Ô
1
D
{
#
0
(56)

;
.
M1
)Ay
J
;
]
.
M1
L

Ô
)
y
2
2
Ô

;
] ]
.
<1
Ô
Ô
Ô
y
;
.

)Ay
Ô
176 J
;
]
.
<1
L

Ô
)Ay
Ô

;
] ]
.
<1
Ô
(57)
where
;
] ]
.
<1
is the Hessian matrix
;^] ]
.
M1c






_ 2 <
_`
1
_`
1
_ 2 <
_`
1
_`
2

_ 2 <
_`
1
_`
b
_ 2
<
_`
2
_`
1
_ 2
<
_`
2
_`
2
_ 2
<
_`
2
_`
b
.
.
.
.
.
.
.
.
.
_ 2 <
_`ab¯_`
1
_ 2 <
_`ab¯_a`
2

_ 2 <
_`ab_`ab
ff






Page 50
hidden
44 Jonathan Richard Shewchuk
-4
-2
0
2
4
6
-2
0
2
4
6
-250
0
250
500
(a)

1

2
;
.
<1

-4 -2 2 4 6
-2
2
4
6

1
(b)

2
 m
0 n
-0.04 -0.02 0.02 0.04
-200
200
400
600
y
(c);
.

m
"
n
)Ay
Ô
m
"
n
1
-4 -2 2 4 6
-2
2
4
6

1
(d) 2
(m
0 n
Figure 37: Convergence of the nonlinear Conjugate Gradient Method. (a) A complicated function with many
local minima and maxima. (b) Convergence path of Fletcher-Reeves CG. Unlike linear CG, convergence
does not occur in two steps. (c) Cross-section of the surface corresponding to the first line search. (d)
Convergence path of Polak-Ribie`re CG.
Page 52
hidden
46 Jonathan Richard Shewchuk
The function
;
.

)Ay
Ô
1
is approximately minimized by setting Expression 57 to zero, giving
y
+*
;
]

Ô
Ô

;
] ]
Ô

The truncated Taylor series approximates
;
.

)Yy
Ô
1
with a parabola; we step to the bottom of the parabola
(see Figure 39). In fact, if ; is a quadratic form, then this parabolic approximation is exact, because ; ] ] is
just the familiar matrix  . In general, the search directions are conjugate if they are ; ] ] -orthogonal. The
meaning of “conjugate” keeps changing, though, because ; ] ] varies with  . The more quickly ; ] ] varies
with

, the more quickly the search directions lose conjugacy. On the other hand, the closer (m "
n
is to the
solution, the less
;
] ]
varies from iteration to iteration. The closer the starting point is to the solution, the
more similar the convergence of nonlinear CG is to that of linear CG.
To perform an exact line search of a non-quadratic function, repeated steps must be taken along the line
until
;
]

Ô is zero; hence, one CG iteration may include many Newton-Raphson iterations. The values of
;
]

Ô and Ô
flÓ;
] ]
Ô must be evaluated at each step. These evaluations may be inexpensive if Ô
flŁ;
] ]
Ô can be
analytically simplified, but if the full matrix
;
] ]
must be evaluated repeatedly, the algorithm is prohibitively
slow. For some applications, it is possible to circumvent this problem by performing an approximate line
search that uses only the diagonal elements of
;
] ]
. Of course, there are functions for which it is not possible
to evaluate
;
] ]
at all.
To perform an exact line search without computing
;
] ]
, the Secant method approximates the second
derivative of
;
.

)Yy
Ô
1
by evaluating the first derivative at the distinct points
y

0 and
y
?>
, where
>
is
an arbitrary small nonzero number:
Ô
2
Ô
y
2
;
.

)¡y
Ô
1@6
J
z
z|{
;
.

)Ay
Ô
1
L
{
#BA
*ÝJ
z
z|{
;
.

)Ay
Ô
1
L
{
#
0
>
>
l

0

J
;
]
.

)
>
Ô
1
L

Ô
*ÝJ
;
]
.
M1
L

Ô
>
E
(58)
which becomes a better approximation to the second derivative as
y
and
>
approach zero. If we substitute
Expression 58 for the third term of the Taylor expansion (Equation 56), we have
Ô
Ô
y
;
.

)Ay
Ô
1*6qJ
;
]
.
<1
L

Ô
)
y
>DC
J
;
]
.

)
>
Ô
1
L

Ô
*%J
;
]
.
M1
L

ÔFE

Minimize
;
.

)Ay
Ô
1
by setting its derivative to zero:
y
u*G>
J
;
]
.
M1
L

Ô
J
;
]
.

)
>
Ô
1
L

Ô
*ÝJ
;
]
.
<1
L

Ô
(59)
Like Newton-Raphson, the Secant method also approximates
;
.

)%y
Ô
1
with a parabola, but instead
of choosing the parabola by finding the first and second derivative at a point, it finds the first derivative at
two different points (see Figure 40). Typically, we will choose an arbitrary > on the first Secant method
iteration; on subsequent iterations we will choose

)
>
Ô to be the value of

from the previous Secant
method iteration. In other words, if we let
yIH
"J denote the value of
y
calculated during Secant iteration Œ ,
then
>
H
"$…
1
J
+*
y
H
"KJ
.
Both the Newton-Raphson and Secant methods should be terminated when

is reasonably close to the
exact solution. Demanding too little precision could cause a failure of convergence, but demanding too
fine precision makes the computation unnecessarily slow and gains nothing, because conjugacy will break
Page 54
hidden
48 Jonathan Richard Shewchuk
-4 -2 2 4 6
-2
2
4
6

1

2
 m
0 n
Figure 41: The preconditioned nonlinear Conjugate Gradient Method, using the Polak-Ribie`re formula and
a diagonal preconditioner. The space has been “stretched” to show the improvement in circularity of the
contour lines around the minimum.
for use as a preconditioner. However, be forewarned that if

is sufficiently far from a local minimum, the
diagonal elements of the Hessian may not all be positive. A preconditioner should be positive-definite, so
nonpositive diagonal elements cannot be allowed. A conservative solution is to not precondition (set '  ” )
when the Hessian cannot be guaranteed to be positive-definite. Figure 41 demonstrates the convergence of
diagonally preconditioned nonlinear CG, with the Polak-Ribie`re formula, on the same function illustrated
in Figure 37. Here, I have cheated by using the diagonal of
;
] ]
at the solution point

to precondition every
iteration.
A Notes
Conjugate Direction methods were probably first presented by Schmidt [14] in 1908, and were independently
reinvented by Fox, Huskey, and Wilkinson [7] in 1948. In the early fifties, the method of Conjugate Gradients
was discovered independently by Hestenes [10] and Stiefel [15]; shortly thereafter, they jointly published
what is considered the seminal reference on CG [11]. Convergence bounds for CG in terms of Chebyshev
polynomials were developed by Kaniel [12]. A more thorough analysis of CG convergence is provided by
van der Sluis and van der Vorst [16]. CG was popularized as an iterative method for large, sparse matrices
by Reid [13] in 1971.
CG was generalized to nonlinear problems in 1964 by Fletcher and Reeves [6], based on work by
Davidon [4] and Fletcher and Powell [5]. Convergence of nonlinear CG with inexact line searches was
analyzed by Daniel [3]. The choice of
Ü
for nonlinear CG is discussed by Gilbert and Nocedal [8].
A history and extensive annotated bibliography of CG to the mid-seventies is provided by Golub and
O’Leary [9]. Most research since that time has focused on nonsymmetric systems. A survey of iterative
methods for the solution of linear systems is offered by Barrett et al. [1].
Page 55
hidden
Canned Algorithms 49
B Canned Algorithms
The code given in this section represents efficient implementations of the algorithms discussed in this article.
B1. Steepest Descent
Given the inputs

,

, a starting value

, a maximum number of iterations Œ§ÏoÐ ` , and an error tolerance
5‹ 1:
ŒNM 0
rOM
U*?
Ù
M r

r
Ù
0 M
ÙWhile ŒU‹¡Œ ÏsÐ ` and
Ù
,

2
Ù
0 do
P
M

r
y
M Q
ü
ýR

M

)Ay
r
If Œ is divisible by 50
rSM
s*k

else
rSMCr
*
y
P
Ù
MCr

r
ŒTM Œ
)
1
This algorithm terminates when the maximum number of iterations Œ§ÏoÐ ` has been exceeded, or when
¹
r
m
"
n
¹
Î

¹
r
m
0 n
¹
.
The fast recursive formula for the residual is usually used, but once every 50 iterations, the exact residual
is recalculated to remove accumulated floating point error. Of course, the number 50 is arbitrary; for large
,
©
might be appropriate. If the tolerance is large, the residual need not be corrected at all (in practice,
this correction is rarely used). If the tolerance is close to the limits of the floating point precision of the
machine, a test should be added after
Ù
is evaluated to check if
Ù
Î

2
Ù
0, and if this test holds true, the exact
residual should also be recomputed and
Ù
reevaluated. This prevents the procedure from terminating early
due to floating point roundoff error.
Page 56
hidden
50 Jonathan Richard Shewchuk
B2. Conjugate Gradients
Given the inputs

,

, a starting value

, a maximum number of iterations Œ§ÏoÐ ` , and an error tolerance
5‹ 1:
ŒNM 0
rUM
o*Y
ÔVMCr
Ù

«XWYMKr

r
Ù
0 M
Ù

«XW
While Œc‹¨Œ2ÏoÐ ` and
Ù

«;W
,

2
Ù
0 do
P
M

Ô
y
MZQ
b\[(]
z
ý
R

M

)Ay
Ô
If Œ is divisible by 50
rOM
U*?
else
rOM r
*
y
P
Ù_^a`
z
M
Ù

«XW
Ù

«;WbMCr

r
Ü
McQ
bd[(]
Qfe;g h
ÔiM r
)
Ü
Ô
ŒNMKŒ
)
1
See the comments at the end of Section B1.
Page 57
hidden
Canned Algorithms 51
B3. Preconditioned Conjugate Gradients
Given the inputs

,

, a starting value

, a (perhaps implicitly defined) preconditioner ' , a maximum
number of iterations Œ§ÏoÐ ` , and an error tolerance ‹ 1:
ŒNM 0
rUM
o*Y
ÔVMj'
8 1
r
Ù

«XWYMKr

Ô
Ù
0 M
Ù

«XW
While Œc‹¨Œ2ÏoÐ ` and
Ù

«;W
,

2
Ù
0 do
P
M

Ô
y
MZQ
b\[(]
z
ý
R

M

)Ay
Ô
If Œ is divisible by 50
rOM
U*?
else
rOM r
*
y
P
k
Mj'
8 1
r
Ù_^a`
z
M
Ù

«XW
Ù

«;WbMCr

k
Ü
M
Q
bd[(]
Q
e;g h
ÔiM
k
)
Ü
Ô
ŒNMKŒ
)
1
The statement “ k Ml'
8 1
r ” implies that one should apply the preconditioner, which may not actually
be in the form of a matrix.
See also the comments at the end of Section B1.
Page 58
hidden
52 Jonathan Richard Shewchuk
B4. Nonlinear Conjugate Gradients with Newton-Raphson and Fletcher-Reeves
Given a function
;
, a starting value

, a maximum number of CG iterations Œ2ÏoÐ ` , a CG error tolerance
‹ 1, a maximum number of Newton-Raphson iterations
µ
ÏoÐ
`
, and a Newton-Raphson error tolerance#
‹ 1:
ŒNM 0

M 0
rUM
* ;
]
.
M1
ÔVMCr
Ù

«XWYMKr

r
Ù
0 M
Ù

«XW
While Œc‹¨Œ2ÏoÐ ` and
Ù

«;W
,

2
Ù
0 do
µ
M 0
Ù
z
MKÔ

Ô
Do
y
M
*
H
<_m
m
`
n
J
ý
z
z
ý
<
m m
m
`
n
z

M

)ty
Ô
µ
M
µ
)
1
while
µ
‹
µ
ÏsÐ
` and
y
2
Ù
z
,
#
2
rSM
*
;
]
.
<1
Ù_^a`
z
M
Ù

«XW
Ù

«;WbMCr

r
Ü
McQ
bd[(]
Q
e;g h
ÔiM r
)
Ü
Ô

M

)
1
If
¶d
or r

Ô
Î 0
ÔiMCr

M 0
ŒNMKŒ
)
1
This algorithm terminates when the maximum number of iterations Œ§ÏoÐ ` has been exceeded, or when
¹
r
m
"
n
¹
Î

¹
r
m
0 n
¹
.
Each Newton-Raphson iteration adds
y
Ô to

; the iterations are terminated when each update
y
Ô falls
below a given tolerance ( ¹
y
Ô
¹
Î
#
), or when the number of iterations exceeds
µ
ÏsÐ
`
. A fast inexact line
search can be accomplished by using a small
µ
ÏoÐ
` and/or by approximating the Hessian
;
] ]
.
<1
with its
diagonal.
Nonlinear CG is restarted (by setting ÔDM r ) whenever a search direction is computed that is not a
descent direction. It is also restarted once every iterations, to improve convergence for small .
The calculation of
y
may result in a divide-by-zero error. This may occur because the starting point
 m
0 n
is not sufficiently close to the desired minimum, or because
;
is not twice continuously differentiable. In
the former case, the solution is to choose a better starting point or a more sophisticated line search. In the
latter case, CG might not be the most appropriate minimization algorithm.
Page 60
hidden
54 Jonathan Richard Shewchuk
to achieve convergence.
The Polak-Ribie`re
Ü
parameter is Q
b\[]
8
Q

ø
h
Q e;g h

ü ý
÷úøK{
1
ù}|
÷úøK{
1
ù
8
ü}ý
÷úøK{
1
ù}|
÷úøjù
ü ý
÷úøjù |
÷úøjù

ü}ý
÷úø{
1
ù~
û
1
m
ü
÷‰ø{
1
ù
8
ü
÷úøjù
n
ü ý
÷úøjù ~
û
1
ü
÷úøjù
. Care must
be taken that the preconditioner ' is always positive-definite. The preconditioner is not necessarily in the
form of a matrix.
Nonlinear CG is restarted (by setting ԀM r ) whenever the Polak-Ribie`re
Ü
parameter is negative. It is
also restarted once every iterations, to improve convergence for small .
Nonlinear CG presents several choices: Preconditioned or not, Newton-Raphson method or Secant
method or another method, Fletcher-Reeves or Polak-Ribie`re. It should be possible to construct any of the
variations from the versions presented above. (Polak-Ribie`re is almost always to be preferred.)
C Ugly Proofs
C1. The Solution to 

Minimizes the Quadratic Form
Suppose

is symmetric. Let

be a point that satisfies
 
and minimizes the quadratic form
(Equation 3), and let p be an error term. Then
;
.

)
p
1Õ 1
2 .

)
p
1


.

)
p
1Ł*?

.

)
p
1
)A@
(by Equation 3)
 1
2




)
p



)
1
2
p


p
*?

d*t

p
)A@
(by symmetry of  )
 1
2



d*?


)t@U)
p


)
1
2
p


p
*?

p

;
.
<1
)
1
2
p


p

If

is positive-definite, then the latter term is positive for all pl

0; therefore

minimizes
;
.
C2. A Symmetric Matrix Has Orthogonal Eigenvectors.
Any matrix has at least one eigenvector. To see this, note that det .
u*
‡'”
1
is a polynomial in ‡ , and
therefore has at least one zero (possibly complex); call it ‡ » . The matrix 
* ‡ » ” has determinant zero
and is therefore singular, so there must be a nonzero vector † for which .
4*
‡
»
”
1
†

0. The vector † is
an eigenvector, because

†

‡
»
† .
Any symmetric matrix has a full complement of orthogonal eigenvectors. To prove this, I will
demonstrate for the case of a 4  4 matrix

and let you make the obvious generalization to matrices of
arbitrary size. It has already been proven that

has at least one eigenvector † with eigenvalue ‡
»
. Let

1

†œª
¹
†
¹
, which has unit length. Choose three arbitrary vectors

2
E

3
E

4 so that the

"
are mutually
orthogonal and all of unit length (such vectors can be found by Gram-Schmidt orthogonalization). Let

´J 
1

2

3

4 L . Because the

"
are orthonormal, 



” , and so 
fl?

8 1
. Note also that for
Page 61
hidden
Ugly Proofs 55
ŒG‚
ƒ 1, „†…1 ‡ „‰ˆ ƒ „†…ˆ ‡ „ 1 ƒ „‰…ˆ3ن‹ „ 1 ƒ 0. Thus,
Œ
…
‡
Œ
ƒ Ž
Ž
Ž
„†…1
„ …2
„ …3
„†…4
’‘
‘
‘
“
‡•”
„ 1 „ 2 „ 3 „ 4 –
ƒ—Ž
Ž
Ž
„†…1
„ …2
„ …3
„†…4
’‘
‘
‘
“
”
ن‹
„ 1
‡
„ 2
‡
„ 3
‡
„ 4 –
ƒ Ž
Ž
Ž
ى‹
0 0 0
0
0 ˜
0
’‘
‘
‘
“š™
where ˜ is a 3 › 3 symmetric matrix. ˜ must have some eigenvalue œ with eigenvalue
Łffi
. Let žœ be a
vector of length 4 whose first element is zero, and whose remaining elements are the elements of œ . Clearly,
Œ Ÿ 1
‡
Œ
ž
œ
ƒ
Œ
…
‡
Œ
ž
œ
ƒ—Ž
Ž
Ž

Ł ‹
0 0 0
0
0 ˜
0
’‘
‘
‘
“
ž
œ
ƒ
Łffi
ž
œ¢¡
In other words,
‡
Œ
žœ
ƒ
ى
Œ
žœ , and thus Œ žœ is an eigenvector of
‡
. Furthermore, „ …1
Œ
žœ
ƒ
£
1 0 0 0 ¤¥žœ ƒ 0, so „ 1 and
Œ
žœ are orthogonal. Thus,
‡
has at least two orthogonal eigenvectors!
The more general statement that a symmetric matrix has ¦ orthogonal eigenvectors is proven by induction.
For the example above, assume any 3 › 3 matrix (such as ˜ ) has 3 orthogonal eigenvectors; call them žœ 1,
žœ 2, and žœ 3. Then
Œ
žœ 1,
Œ
žœ 2, and
Œ
žœ 3 are eigenvectors of
‡
, and they are orthogonal because Œ
has orthonormal columns, and therefore maps orthogonal vectors to orthogonal vectors. Thus,
‡
has 4
orthogonal eigenvectors.
C3. Optimality of Chebyshev Polynomials
Chebyshev polynomials are optimal for minimization of expressions like Expression 50 because they
increase in magnitude more quickly outside the range
£§
1
™
1 ¤ than any other polynomial that is restricted to
have magnitude no greater than one inside the range
£§
1
™
1 ¤ .
The Chebyshev polynomial of degree ¨ ,
©
ˆª}«­¬
ƒ
1
2 ” ª}«o®¯ «
2
® 1 ¬ ˆ ®°ª}«
§
¯ «
2 § 1 ¬ ˆ
–
™
can also be expressed on the region
£§
1
™
1 ¤ as
©
ˆaª}«­¬
ƒ cos ª¨ cos
Ÿ 1
«­¬
™
§
1 ±²«³± 1 ¡
From this expression (and from Figure 32), one can deduce that the Chebyshev polynomials have the
property ´
©
ˆª}«­¬
´
± 1
™
§
1 ±³«9± 1
and oscillate rapidly between
§
1 and 1:
©
ˆTµ cos µ3¶¸·
¨º¹*¹
ƒ
ª
§
1 ¬w»
™

ƒ 0
™
1
™
¡¡¡
™
¨¼¡
Page 62
hidden
56 Jonathan Richard Shewchuk
Notice that the ¨ zeros of © ˆ must fall between the ¨†® 1 extrema of © ˆ in the range
£§
1
™
1 ¤ . For an example,
see the five zeros of © 5 ª}«­¬ in Figure 32.
Similarly, the function ½
ˆ¼ª
Ł
¬
ƒ
©
ˆ1¾À¿rÁFÂÃÄB¿
ÁBŒÆ
Ÿ 2
¿
¿ ÁFÂÃ
Ÿ
¿
ÁFÅKƗÇ
©
ˆ ¾¿ ÁFÂà ÄB¿
ÁFÅKÆ
¿ ÁFÂÃ
Ÿ
¿
ÁFÅKÆ Ç
oscillates in the range È © ˆª_¿rÁFÂÃ_ÄB¿
ÁFÅKÆ
¿rÁFÂÃ
Ÿ
¿
ÁFÅKÆ
¬
Ÿ 1 on the domain
£
ŁÊÉ
ˆÌË
™
نÉIͼÎ
¤ .
½
ˆª
Ł
¬ also satisfies the requirement
that
½
ˆaª 0 ¬ ƒ 1.
The proof relies on a cute trick. There is no polynomial ÏЈ¼ª
Ł
¬ of degree ¨ such that ÏЈª 0 ¬ ƒ 1 and ÏЈ
is better than
½
ˆ on the range
£
Ł É
ˆÌË
™
Ł É­Í¼Î
¤ . To prove this, suppose that there is such a polynomial; then,
ÏЈª
Ł
¬ÐÑ
©
ˆª ¿rÁFÂaÃrÄB¿
ÁFÅKÆ
¿rÁFÂÃ
Ÿ
¿
ÁFÅKÆ
¬
Ÿ 1 on the range
£
ŁÊÉ
ˆÌË
™
نÉIÍÒÎ
¤ . It follows that the polynomial
½
ˆ
§
ÏЈ has a zero
at
Ł
ƒ 0, and also has ¨ zeros on the range
£
ŁÊÉ
ˆË
™
نÉIͼÎ
¤ . Hence,
½
ˆ
§
ÏUˆ is a polynomial of degree ¨ with
at least ¨1® 1 zeros, which is impossible. By contradiction, I conclude that the Chebyshev polynomial of
degree ¨ optimizes Expression 50.
D Homework
For the following questions, assume (unless otherwise stated) that you are using exact arithmetic; there is
no floating point roundoff error.
1. (Easy.) Prove that the preconditioned Conjugate Gradient method is not affected if the preconditioning
matrix is scaled. In other words, for any nonzero constant Ó , show that the sequence of steps
„ÕÔ 0 Ö
™
„ÕÔ 1 Ö
™
¡¡¡ taken using the preconditioner ÓB× is identical to the steps taken using the preconditioner
× .
2. (Hard, but interesting.) Suppose you wish to solve
‡
„
ƒlØ for a symmetric, positive-definite
¦Ù›Ú¦ matrix
‡
. Unfortunately, the trauma of your linear algebra course has caused you to repress
all memories of the Conjugate Gradient algorithm. Seeing you in distress, the Good Eigenfairy
materializes and grants you a list of the Û distinct eigenvalues (but not the eigenvectors) of
‡
.
However, you do not know how many times each eigenvalue is repeated.
Clever person that you are, you mumbled the following algorithm in your sleep this morning:
Choose an arbitrary starting point „ Ô 0 Ö
For ¨TÜ 0 to Û
§
1
Ý
Ô
ˆ
Ö
Ü
Ø
§
‡
„ÕÔ
ˆ
Ö
Remove an arbitrary eigenvalue from the list and call it
Ł
ˆ
„ÕÔ
ˆ
Ä
1 Ö Üj„ÕÔ ˆ Ö ® Ł
Ÿ 1
ˆ
Ý
Ô
ˆ
Ö
No eigenvalue is used twice; on termination, the list is empty.
(a) Show that upon termination of this algorithm, „ ÔÞ
Ö
is the solution to
‡
„
ƒ°Ø
. Hint: Read Section
6.1 carefully. Graph the convergence of a two-dimensional example; draw the eigenvector axes,
and try selecting each eigenvalue first. (It is most important to intuitively explain what the
algorithm is doing; if you can, also give equations that prove it.)
Page 63
hidden
Homework 57
(b) Although this routine finds the exact solution after Û iterations, you would like each intermediate
iterate „ Ô
ˆ
Ö
to be as close to the solution as possible. Give a crude rule of thumb for how you
should choose an eigenvalue from the list on each iteration. (In other words, in what order
should the eigenvalues be used?)
(c) What could go terribly wrong with this algorithm if floating point roundoff occurs?
(d) Give a rule of thumb for how you should choose an eigenvalue from the list on each iteration to
prevent floating point roundoff error from escalating. Hint: The answer is not the same as that
of question (b).
Page 64
hidden
58 Jonathan Richard Shewchuk
References
[1] Richard Barrett, Michael Berry, Tony Chan, James Demmel, June Donato, Jack Dongarra, Victor
Eijkhout, Roldan Pozo, Charles Romine, and Henk van der Vorst, Templates for the Solution of Linear
Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, Pennsylvania, 1993.
[2] William L. Briggs, A Multigrid Tutorial, SIAM, Philadelphia, Pennsylvania, 1987.
[3] James W. Daniel, Convergence of the Conjugate Gradient Method with Computationally Convenient
Modifications, Numerische Mathematik 10 (1967), 125–131.
[4] W. C. Davidon, Variable Metric Method for Minimization, Tech. Report ANL-5990, Argonne National
Laboratory, Argonne, Illinois, 1959.
[5] R. Fletcher and M. J. D. Powell, A Rapidly Convergent Descent Method for Minimization, Computer
Journal 6 (1963), 163–168.
[6] R. Fletcher and C. M. Reeves, Function Minimization by Conjugate Gradients, Computer Journal 7
(1964), 149–154.
[7] L. Fox, H. D. Huskey, and J. H. Wilkinson, Notes on the Solution of Algebraic Linear Simultaneous
Equations, Quarterly Journal of Mechanics and Applied Mathematics 1 (1948), 149–173.
[8] Jean Charles Gilbert and Jorge Nocedal, Global Convergence Properties of Conjugate Gradient
Methods for Optimization, SIAM Journal on Optimization 2 (1992), no. 1, 21–42.
[9] Gene H. Golub and Dianne P. O’Leary, Some History of the Conjugate Gradient and Lanczos Algo-
rithms: 1948–1976, SIAM Review 31 (1989), no. 1, 50–102.
[10] Magnus R. Hestenes, Iterative Methods for Solving Linear Equations, Journal of Optimization Theory
and Applications 11 (1973), no. 4, 323–334, Originally published in 1951 as NAML Report No. 52-9,
National Bureau of Standards, Washington, D.C.
[11] Magnus R. Hestenes and Eduard Stiefel, Methods of Conjugate Gradients for Solving Linear Systems,
Journal of Research of the National Bureau of Standards 49 (1952), 409–436.
[12] Shmuel Kaniel, Estimates for Some Computational Techniques in Linear Algebra, Mathematics of
Computation 20 (1966), 369–378.
[13] John K. Reid, On the Method of Conjugate Gradients for the Solution of Large Sparse Systems of
Linear Equations, Large Sparse Sets of Linear Equations (London and New York) (John K. Reid, ed.),
Academic Press, London and New York, 1971, pp. 231–254.
[14] E. Schmidt, Title unknown, Rendiconti del Circolo Matematico di Palermo 25 (1908), 53–77.
[15] Eduard Stiefel, ¨Uber einige Methoden der Relaxationsrechnung, Zeitschrift fu¨r Angewandte Mathe-
matik und Physik 3 (1952), no. 1, 1–33.
[16] A. van der Sluis and H. A. van der Vorst, The Rate of Convergence of Conjugate Gradients, Numerische
Mathematik 48 (1986), no. 5, 543–560.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

451 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
43% Ph.D. Student
 
12% Post Doc
 
8% Student (Master)
by Country
 
32% United States
 
9% United Kingdom
 
8% Germany