Subgradients#

For a convex and differentiable function \(f\), the gradient of \(f\), denoted as \(\nabla f\), satisfies

\[ f(y) \geq f(x) + \nabla f(x)^T(y - x),\ \forall x, y. \]

For functions that are convex but not necessarily differentiable (e.g. \(f(x) = |x|\)), we can generalize the concept of gradients to subgradients. The subgradient \(g\in\mathbb{R}^n\) of a convex function \(f\) at \(x\) satisfies

(1)#\[ f(y) \geq f(x) + g^T(y - x),\ \forall y. \]

Subgradients Properties

  • The subgradient always exists within the relative interior of the domain. It does not necessarily exist at the boundary, e.g., when there is an indicator functions.

  • When \(f\) is differentiable at \(x\), then \(g = \nabla f(x)\) uniquely.

Subdifferential#

The set of all subgradients of a convex function \(f\) is called the subdifferential

\[ \partial f(x) = \{g \mid g\in\mathbb{R}^n, g\ \text{is a subgradient of}\ f\ \text{at}\ x\}. \]

Subdifferential Properties

  • \(\partial f(x)\) is always closed and convex.

  • \(\partial f(x)\) is nonempty for convex \(f\)s.

  • If \(f\) is differentiable at \(x\), then \(\partial f(x) = \{\nabla f(x)\}\).

  • If \(\partial f(x) = \{g\}\), then \(f\) is differentiable at \(x\) and \(\nabla f(x) = g\).

Subgradient Calculus#

  • Scaling: \(\partial(af) = a\partial(f)\) when \(a > 0\).

  • Addition: \(\partial(f_1 + f_2) = \partial f_1 + \partial f_2\).

  • Affine composition: If \(h(x) = f(Ax + b)\) then \(\partial h(x) = A^T\partial f(Ax + b)\).

  • Finite pointwise maximum: if \(f(x) = \max_{i = 1, \cdots, m}f_i(x)\), then the subdifferential is the convex hull of the union of the subdifferentials of all active functions \(f_i\), where the maximum is obtained, i.e., \(f_i(x) = f(x)\),

\[\partial f(x) = \mathrm{conv}\Big(\bigcup_{i: f_i(x) = f(x)}\partial f_i(x)\Big)\]
  • General pointwise maximum: if \(f(x) = \max_{s \in S}f_s(x)\), then

\[\partial f(x) = \mathrm{cl}\Big\{\mathrm{conv}\Big(\bigcup_{s: f_s(x) = f(x)}\partial f_s(x)\Big)\Big\}\]

Subdifferential of an Indicator Function#

For a convex set \(C \subseteq \mathbb{R}^n\), the indicator function on set \(C\) is defined as \(I_C: \mathbb{R}^n \rightarrow \mathbb{R}^n\):

\[\begin{split} I_C(x) = \begin{cases} 0 & \mathrm{if}\ x \in C\\ \infty & \mathrm{if}\ x \notin C \end{cases} \end{split}\]

The normal cone \(\mathcal{N}_C(x)\) of the set \(C\) at \(x\) is defined as

\[ \mathcal{N}_C(x) = \{g \mid g\in\mathbb{R}^n, g^Tx \geq g^Ty\ \text{for any}\ y \in C\}. \]

Then, we have the relationship

\[ \partial I_C(x) = \mathcal{N}_C(x). \]
Proof

From (1), we know that the subgradient of \(I_C\) satisfies the relationship

\[ I_C(y) \geq I_C(x) + g^T(y - x), \forall y \in C. \]

Since \(x, y \in C\), we have \(I_C(x) = I_C(y) = 0\). Thus, the above inequality becomes

\[ 0 \geq 0 + g^T(y - x) \rightarrow g^Tx \geq g^Ty. \]

Since any \(g\) that satisfies \(g^Tx \geq g^Ty\) will belong to \(\mathcal{N}_C(x)\), we can conclude that \(\partial I_C(x) = \mathcal{N}_C(x)\).

Subgradient Optimality Condition#

For any \(f\) (can be nonconvex), we have

\[ f(x^\star) = \min_x f(x) \iff 0 \in \partial f(x^\star) \]

In other words, \(x^\star\) is a minimizer if and only if \(0\) belongs to the subgradient of \(f\) at \(x^\star\). We can easily understand this by setting \(g = 0\) in (1), which leads to

\[ f(y) \geq f(x^\star) + 0^T(y - x^\star) = f(x^\star). \]

Using this, we can also derive a variant of the first-order optimality condition

\[ 0 \in \partial f(x) + \mathcal{N}_C(x). \]

Examples#

Lasso Optimality Conditions#

Given \(y \in \mathbb{R}^n\), \(X \in \mathbb{R}^{n \times p}\), and \(\lambda \geq 0\), the Lasso problem can be written as

\[ \min_\beta \frac{1}{2}\|y - X\beta\|_2^2 + \lambda\|\beta\|_1. \]

Using the subgradient optimality condition, we have

\[ 0 \in \partial\Big(\frac{1}{2}\|y - X\beta\|_2^2 + \lambda\|\beta\|_1\Big) = -X^T(y - X\beta) + \lambda\partial\|\beta\|_1. \]

Which is equivalent to saying

\[ X^T(y - X\beta) = \lambda v \]

with \(v \in \partial\|\beta\|_1\). We can write our the individual elements of \(v\) as

\[\begin{split} v_i = \begin{cases} \{1\} & \mathrm{if}\ \beta_i > 0\\ \{-1\} & \mathrm{if}\ \beta_i < 0\\ [-1, 1] & \mathrm{if}\ \beta_i = 0 \end{cases}. \end{split}\]

Then, if we define the \(i\)-th column of \(X\) to be \(X_i\), we have the Lasso optimality condition as

\[\begin{split} \begin{cases} X_i^T(y - X\beta) = \lambda\mathrm{sign}(\beta_i) & \mathrm{if}\ \beta_i \neq 0\\ |X_i^T(y - X\beta)| \leq \lambda & \mathrm{if}\ \beta_i = 0 \end{cases}. \end{split}\]

Soft-thresholding#

The soft-thresholding problem is simply the Lasso problem with \(X = I\):

\[ \min_\beta \frac{1}{2}\|y - \beta\|_2^2 + \lambda\|\beta\|_1. \]

Then, we have the optimality condition as

\[\begin{split} \begin{cases} y_i - \beta_i = \lambda\mathrm{sign}(\beta_i) & \mathrm{if}\ \beta_i \neq 0\\ |y_i - \beta_i| \leq \lambda & \mathrm{if}\ \beta_i = 0 \end{cases}. \end{split}\]

Define the soft-thresholding operator \(S_\lambda(y)\) as

\[\begin{split} [S_\lambda(y)]_i = \begin{cases} y_i - \lambda & \mathrm{if}\ y_i > \lambda\\ 0 & \mathrm{if}\ -\lambda \leq y_i \leq \lambda\\ y_i + \lambda & \mathrm{if}\ y_i < -\lambda \end{cases}. \end{split}\]

We can see that if we set \(\beta = S_\lambda(y)\) the optimality condition is satisfied.