- Source: Nonlinear conjugate gradient method
In numerical optimization, the nonlinear conjugate gradient method generalizes the conjugate gradient method to nonlinear optimization. For a quadratic function
f
(
x
)
{\displaystyle \displaystyle f(x)}
f
(
x
)
=
‖
A
x
−
b
‖
2
,
{\displaystyle \displaystyle f(x)=\|Ax-b\|^{2},}
the minimum of
f
{\displaystyle f}
is obtained when the gradient is 0:
∇
x
f
=
2
A
T
(
A
x
−
b
)
=
0
{\displaystyle \nabla _{x}f=2A^{T}(Ax-b)=0}
.
Whereas linear conjugate gradient seeks a solution to the linear equation
A
T
A
x
=
A
T
b
{\displaystyle \displaystyle A^{T}Ax=A^{T}b}
, the nonlinear conjugate gradient method is generally
used to find the local minimum of a nonlinear function
using its gradient
∇
x
f
{\displaystyle \nabla _{x}f}
alone. It works when the function is approximately quadratic near the minimum, which is the case when the function is twice differentiable at the minimum and the second derivative is non-singular there.
Given a function
f
(
x
)
{\displaystyle \displaystyle f(x)}
of
N
{\displaystyle N}
variables to minimize, its gradient
∇
x
f
{\displaystyle \nabla _{x}f}
indicates the direction of maximum increase.
One simply starts in the opposite (steepest descent) direction:
Δ
x
0
=
−
∇
x
f
(
x
0
)
{\displaystyle \Delta x_{0}=-\nabla _{x}f(x_{0})}
with an adjustable step length
α
{\displaystyle \displaystyle \alpha }
and performs a line search in this direction until it reaches the minimum of
f
{\displaystyle \displaystyle f}
:
α
0
:=
arg
min
α
f
(
x
0
+
α
Δ
x
0
)
{\displaystyle \displaystyle \alpha _{0}:=\arg \min _{\alpha }f(x_{0}+\alpha \Delta x_{0})}
,
x
1
=
x
0
+
α
0
Δ
x
0
{\displaystyle \displaystyle x_{1}=x_{0}+\alpha _{0}\Delta x_{0}}
After this first iteration in the steepest direction
Δ
x
0
{\displaystyle \displaystyle \Delta x_{0}}
, the following steps constitute one iteration of moving along a subsequent conjugate direction
s
n
{\displaystyle \displaystyle s_{n}}
, where
s
0
=
Δ
x
0
{\displaystyle \displaystyle s_{0}=\Delta x_{0}}
:
Calculate the steepest direction:
Δ
x
n
=
−
∇
x
f
(
x
n
)
{\displaystyle \Delta x_{n}=-\nabla _{x}f(x_{n})}
,
Compute
β
n
{\displaystyle \displaystyle \beta _{n}}
according to one of the formulas below,
Update the conjugate direction:
s
n
=
Δ
x
n
+
β
n
s
n
−
1
{\displaystyle \displaystyle s_{n}=\Delta x_{n}+\beta _{n}s_{n-1}}
Perform a line search: optimize
α
n
=
arg
min
α
f
(
x
n
+
α
s
n
)
{\displaystyle \displaystyle \alpha _{n}=\arg \min _{\alpha }f(x_{n}+\alpha s_{n})}
,
Update the position:
x
n
+
1
=
x
n
+
α
n
s
n
{\displaystyle \displaystyle x_{n+1}=x_{n}+\alpha _{n}s_{n}}
,
With a pure quadratic function the minimum is reached within N iterations (excepting roundoff error), but a non-quadratic function will make slower progress. Subsequent search directions lose conjugacy requiring the search direction to be reset to the steepest descent direction at least every N iterations, or sooner if progress stops. However, resetting every iteration turns the method into steepest descent. The algorithm stops when it finds the minimum, determined when no progress is made after a direction reset (i.e. in the steepest descent direction), or when some tolerance criterion is reached.
Within a linear approximation, the parameters
α
{\displaystyle \displaystyle \alpha }
and
β
{\displaystyle \displaystyle \beta }
are the same as in the
linear conjugate gradient method but have been obtained with line searches.
The conjugate gradient method can follow narrow (ill-conditioned) valleys, where the steepest descent method slows down and follows a criss-cross pattern.
Four of the best known formulas for
β
n
{\displaystyle \displaystyle \beta _{n}}
are named after their developers:
Fletcher–Reeves:
β
n
F
R
=
Δ
x
n
T
Δ
x
n
Δ
x
n
−
1
T
Δ
x
n
−
1
.
{\displaystyle \beta _{n}^{FR}={\frac {\Delta x_{n}^{T}\Delta x_{n}}{\Delta x_{n-1}^{T}\Delta x_{n-1}}}.}
Polak–Ribière:
β
n
P
R
=
Δ
x
n
T
(
Δ
x
n
−
Δ
x
n
−
1
)
Δ
x
n
−
1
T
Δ
x
n
−
1
.
{\displaystyle \beta _{n}^{PR}={\frac {\Delta x_{n}^{T}(\Delta x_{n}-\Delta x_{n-1})}{\Delta x_{n-1}^{T}\Delta x_{n-1}}}.}
Hestenes–Stiefel:
β
n
H
S
=
Δ
x
n
T
(
Δ
x
n
−
Δ
x
n
−
1
)
−
s
n
−
1
T
(
Δ
x
n
−
Δ
x
n
−
1
)
.
{\displaystyle \beta _{n}^{HS}={\frac {\Delta x_{n}^{T}(\Delta x_{n}-\Delta x_{n-1})}{-s_{n-1}^{T}(\Delta x_{n}-\Delta x_{n-1})}}.}
Dai–Yuan:
β
n
D
Y
=
Δ
x
n
T
Δ
x
n
−
s
n
−
1
T
(
Δ
x
n
−
Δ
x
n
−
1
)
.
{\displaystyle \beta _{n}^{DY}={\frac {\Delta x_{n}^{T}\Delta x_{n}}{-s_{n-1}^{T}(\Delta x_{n}-\Delta x_{n-1})}}.}
.
These formulas are equivalent for a quadratic function, but for nonlinear optimization the preferred formula is a matter of heuristics or taste. A popular choice is
β
=
max
{
0
,
β
P
R
}
{\displaystyle \displaystyle \beta =\max\{0,\beta ^{PR}\}}
, which provides a direction reset automatically.
Algorithms based on Newton's method potentially converge much faster. There, both step direction and length are computed from the gradient as the solution of a linear system of equations, with the coefficient matrix being the exact Hessian matrix (for Newton's method proper) or an estimate thereof (in the quasi-Newton methods, where the observed change in the gradient during the iterations is used to update the Hessian estimate). For high-dimensional problems, the exact computation of the Hessian is usually prohibitively expensive, and even its storage can be problematic, requiring
O
(
N
2
)
{\displaystyle O(N^{2})}
memory (but see the limited-memory L-BFGS quasi-Newton method).
The conjugate gradient method can also be derived using optimal control theory. In this accelerated optimization theory, the conjugate gradient method falls out as a nonlinear optimal feedback controller,
u
=
k
(
x
,
x
˙
)
:=
−
γ
a
∇
x
f
(
x
)
−
γ
b
x
˙
{\displaystyle u=k(x,{\dot {x}}):=-\gamma _{a}\nabla _{x}f(x)-\gamma _{b}{\dot {x}}}
for the double integrator system,
x
¨
=
u
{\displaystyle {\ddot {x}}=u}
The quantities
γ
a
>
0
{\displaystyle \gamma _{a}>0}
and
γ
b
>
0
{\displaystyle \gamma _{b}>0}
are variable feedback gains.
See also
Gradient descent
Broyden–Fletcher–Goldfarb–Shanno algorithm
Conjugate gradient method
L-BFGS (limited memory BFGS)
Nelder–Mead method
Wolfe conditions
References
Kata Kunci Pencarian:
- Nonlinear conjugate gradient method
- Conjugate gradient method
- Gradient method
- Nelder–Mead method
- Slope
- Gradient descent
- Iterative method
- List of numerical analysis topics
- Wolfe conditions
- Newton's method