Newton’s Method#

Newton’s method is originally a root-finding method for nonlinear equations, but in combination with optimality conditions it becomes the workhorse of many optimization algorithms.

Root Finding Using Newton’s Method#

First, we show how we can use Newton’s method to solve the problem of

\[ f(x) = 0. \]

The procedure is actually fairly simple

  1. an initial starting point \(\hat{x} \leftarrow x_0\) would need to be provided

  2. find a linear approximation to \(f(x)\) at \(\hat{x}\) using Taylor’s expansion \(\widetilde{f}(x) = f(\hat{x}) + \nabla f(\hat{x})(x - \hat{x})\)

  3. Solve the problem \(\widetilde{f}(x) = 0\) and set the solution to \(\hat{x}\)

  4. go back to step 2 if \(f(\hat{x}) > \epsilon\).

For solving the problem of \(f(x) = x^2 + 2x - 10 = 0\), the process looks like

newtons method illustration

The iterations for solving \(f(x) = x^2 + 2x - 10 = 0\). [notebook]#

Solving Unconstrained Optimization Problems#

The general form of unconstrained optimization problems is

\[ \min_x \quad f(x). \]

Pure Newton’s Method#

Under the assumption that \(f(x)\) is convex, we have its optimality condition as

\[ \nabla f(x) = 0. \]

Thus, if we use Newton’s method to perform a root finding on the function \(\nabla f(x)\), then we would find our optimal solution. A linear approximation to the gradient would be

\[ \nabla\widetilde{f}(x) = \nabla f(\hat{x}) + \nabla^2 f(\hat{x})(x - \hat{x}). \]

And the solution to \(\nabla\widetilde{f}(x) = 0\) is

\[ x = \hat{x} - \Big[\nabla^2 f(\hat{x})\Big]^{-1}\nabla f(\hat{x}). \]

So, similar to the procedures for root finding, the steps to optimize an unconstrained convex optimization problems are

  1. an initial starting point \(\hat{x} \leftarrow x_0\) would need to be provided

  2. find a linear approximation to \(\nabla f(x)\) at \(\hat{x}\) using Taylor’s expansion \(\nabla\widetilde{f}(x) = \nabla f(\hat{x}) + \nabla^2 f(\hat{x})(x - \hat{x})\)

  3. Solve the problem \(\widetilde{f}(x) = 0\) and set the solution to \(\hat{x} \leftarrow \hat{x} - \Big[\nabla^2 f(\hat{x})\Big]^{-1}\nabla f(\hat{x})\)

  4. go back to step 2 if \(\nabla f(\hat{x}) > \epsilon\).

These are the steps to what is called “pure” Newton’s method. However, what is commonly used right now is “damped” Newton’s method, which is “pure” Newton’s method with an additional backtracking line search.

Damped Newton’s Method#

The damped Newton’s method is simply a combination of pure Newton’s method and backtracking line search. The steps are

  1. an initial starting point \(\hat{x} \leftarrow x_0\) would need to be provided;

  2. compute the direction \(v\);

  3. find \(t\) that satifies \(f(x + tv) \leq f(x) + \alpha t \nabla f^T(x)v\) using backtracking line search;

  4. set the solution to \(\hat{x} \leftarrow \hat{x} + tv\);

  5. go back to step 2 if \(\nabla f(\hat{x}) > \epsilon\).

Damped Newton’s method has distinctly two phase, one damped phase where the reduction for each step is similar and a quadratic convergence phase.

Solving Equality Constrained Optimization Problems#

The general form of equality constrained convex optimization problems is

\[\begin{split} \begin{align} \min_x\ &\ f(x)\\ \mathrm{subject\ to}\ &\ Ax = b. \end{align} \end{split}\]

We can also see Newton’s method as performing a quadratic approximation to the objective and find the minimum to this quadratic approximation. The quadratic approximation in this case is

\[ \widetilde{f}(x) = f(\hat{x}) + \nabla f^T(\hat{x})(x - \hat{x}) + \frac{1}{2}(x - \hat{x})^T\nabla^2f^T(\hat{x})(x - \hat{x}) \]

For equality constrained convex optimization problems, we require Newton’s method to start at a feasible point, thus, we have \(A\hat{x} = b\). We also want the point reached after each step to be feasible. Therefore, we are solving the following problem at each time step

\[\begin{split} \begin{align} \min_x\ &\ f(\hat{x}) + \nabla f^T(\hat{x})(x - \hat{x}) + \frac{1}{2}(x - \hat{x})^T\nabla^2f^T(\hat{x})(x - \hat{x})\\ \mathrm{subject\ to}\ &\ Ax = b. \end{align} \end{split}\]

The solution to this problem \(x^\star\) will give us the direction \(v = x^\star - x\). Note that

\[ Ax^\star = Av + Ax = Av + b = b. \]

Therefore, we require \(Av = 0\). Then, we can rewrite the approximate optimization problem solved at each time step as

\[\begin{split} \begin{align} \min_v\ &\ \nabla f^T(\hat{x})v + \frac{1}{2}v^T\nabla^2f^T(\hat{x})v\\ \mathrm{subject\ to}\ &\ Av = 0. \end{align} \end{split}\]

The Lagrangian for this problem is

\[ L(v, w) = \nabla f^T(\hat{x})v + \frac{1}{2}v^T\nabla^2f^T(\hat{x})v + w^TAv. \]

If we write the KKT conditions for this problem we have the stationarity condition as

\[ \nabla f(\hat{x}) + \nabla^2f^T(\hat{x})v + A^Tw = 0 \]

and we have the primal feasibility condition as

\[ Av = 0. \]

If we write them together, we have

\[\begin{split} \begin{bmatrix} \nabla^2f^T(\hat{x}) & A^T\\ A & \mathbf{0} \end{bmatrix}\begin{bmatrix} v\\ w \end{bmatrix} = \begin{bmatrix} -\nabla f(\hat{x})\\ \mathbf{0} \end{bmatrix} \end{split}\]

Thus, by solving this KKT system we would obtain the update direction \(v\), and then we can do a backtracking line search to find the correct step size, and everything else would be the same as the unconstrained damped Newton’s method.