Proximal Gradient Descent#
The problem of interest has the form of
where \(g(x)\) is convex and differentiable and \(h(x)\) is convex while not necessarily differentiable. Recall when discussing gradient descent we would find a quadratic approximization to the objective function \(f(x)\)
then we take the step such that this quadratic approximization is minimized
The idea here is since only \(g\) is differentiable, why not only make a quadratic approximization to \(g\) and leave \(h\) alone? If that is the case we have
If we define the proximal mapping as
Then, we can write the operations in (1) as
To make it look more familiar, we can use the generalized gradient
which makes the operations in (1) look like
Iterative Soft-thresholding Algorithm (ISTA)#
For the Lasso problem
we can write it in proximal gradient descent form \(\min_\beta g(\beta) + h(\beta)\). We can then get the update as
Let the proximal mapping be
where \(S_{t\lambda}(x)\) is the soft-thresholding operator. Using the fact that \(\nabla g(\beta) = -X^T(y - X\beta)\), we can then write the update using the soft-thresholding operator as
This is called the Iterative Soft-thresholding Algorithm (ISTA). In the general case of
where \(g\) is convex and differentiable, we have the update as \(\beta^+ = S_{t\lambda}(\beta - t\nabla g(\beta))\).
Fast Iterative Soft-thresholding Algorithm (FISTA)#
FISTA is simply ISTA with momentum, the iteratives would have the form of
Set \(\beta^{(0)} = \beta^{(-1)} \in \mathbb{R}^n\);
Momentum step: \(\displaystyle v = \beta^{(k-1)} + \frac{k-2}{k+1}\Big[\beta^{(k-1)} - \beta^{(k-2)}\Big]\);
ISTA update: \(\beta^{(k)} = S_{t_k\lambda}(v - t_k\nabla g(v))\);
Check convergence, if not go back to step 2;