Subgradients#
For a convex and differentiable function \(f\), the gradient of \(f\), denoted as \(\nabla f\), satisfies
For functions that are convex but not necessarily differentiable (e.g. \(f(x) = |x|\)), we can generalize the concept of gradients to subgradients. The subgradient \(g\in\mathbb{R}^n\) of a convex function \(f\) at \(x\) satisfies
Subgradients Properties
The subgradient always exists within the relative interior of the domain. It does not necessarily exist at the boundary, e.g., when there is an indicator functions.
When \(f\) is differentiable at \(x\), then \(g = \nabla f(x)\) uniquely.
Subdifferential#
The set of all subgradients of a convex function \(f\) is called the subdifferential
Subdifferential Properties
\(\partial f(x)\) is always closed and convex.
\(\partial f(x)\) is nonempty for convex \(f\)s.
If \(f\) is differentiable at \(x\), then \(\partial f(x) = \{\nabla f(x)\}\).
If \(\partial f(x) = \{g\}\), then \(f\) is differentiable at \(x\) and \(\nabla f(x) = g\).
Subgradient Calculus#
Scaling: \(\partial(af) = a\partial(f)\) when \(a > 0\).
Addition: \(\partial(f_1 + f_2) = \partial f_1 + \partial f_2\).
Affine composition: If \(h(x) = f(Ax + b)\) then \(\partial h(x) = A^T\partial f(Ax + b)\).
Finite pointwise maximum: if \(f(x) = \max_{i = 1, \cdots, m}f_i(x)\), then the subdifferential is the convex hull of the union of the subdifferentials of all active functions \(f_i\), where the maximum is obtained, i.e., \(f_i(x) = f(x)\),
General pointwise maximum: if \(f(x) = \max_{s \in S}f_s(x)\), then
Subdifferential of an Indicator Function#
For a convex set \(C \subseteq \mathbb{R}^n\), the indicator function on set \(C\) is defined as \(I_C: \mathbb{R}^n \rightarrow \mathbb{R}^n\):
The normal cone \(\mathcal{N}_C(x)\) of the set \(C\) at \(x\) is defined as
Then, we have the relationship
Proof
From (1), we know that the subgradient of \(I_C\) satisfies the relationship
Since \(x, y \in C\), we have \(I_C(x) = I_C(y) = 0\). Thus, the above inequality becomes
Since any \(g\) that satisfies \(g^Tx \geq g^Ty\) will belong to \(\mathcal{N}_C(x)\), we can conclude that \(\partial I_C(x) = \mathcal{N}_C(x)\).
Subgradient Optimality Condition#
For any \(f\) (can be nonconvex), we have
In other words, \(x^\star\) is a minimizer if and only if \(0\) belongs to the subgradient of \(f\) at \(x^\star\). We can easily understand this by setting \(g = 0\) in (1), which leads to
Using this, we can also derive a variant of the first-order optimality condition
Examples#
Lasso Optimality Conditions#
Given \(y \in \mathbb{R}^n\), \(X \in \mathbb{R}^{n \times p}\), and \(\lambda \geq 0\), the Lasso problem can be written as
Using the subgradient optimality condition, we have
Which is equivalent to saying
with \(v \in \partial\|\beta\|_1\). We can write our the individual elements of \(v\) as
Then, if we define the \(i\)-th column of \(X\) to be \(X_i\), we have the Lasso optimality condition as
Soft-thresholding#
The soft-thresholding problem is simply the Lasso problem with \(X = I\):
Then, we have the optimality condition as
Define the soft-thresholding operator \(S_\lambda(y)\) as
We can see that if we set \(\beta = S_\lambda(y)\) the optimality condition is satisfied.