Longest Path Search

Proximal gradient for SVM

Posted 2017-12-22

For a class that's currently being written (ahem, EE104), Prof. Boyd posed an interesting problem of writing a (relatively general, but ideally simple) proximal-gradient optimizer. The idea is that would act as a black-box way for students to plug in machine learning models of a specific form and have the optimizer do all of the hard work immediately, while appearing as transparent as possible.

This led to a rabbit hole of asking what loss functions should be included in the optimization package (huber, square, 1\ell_1, etc.), many of which are relatively straightforward to implement—except, of course, SVM. Almost all of the other cases have few problems in implementation, since computing their proximal gradient is a direct computation (e.g. 1\ell_1 corresponds immediately to a shrinkage operator as we'll see below), yet the SVM loss has a set of hard constraints which I haven't found a nice way of stuffing into the prox-gradient step (and I suspect that there are no such nice ways, but I'd love to be proven wrong, here); thus, every step requires finding a projection into a polygon, which is, itself, a second optimization problem that has to be solved.

Prox gradients are (generally) really well-behaved and I've been having some fun trying to really understand how well they work as general optimizers—I write a few of those thoughts below along with an odd solution to the original problem.

Proximal gradients

Proximal gradients are a nice idea emerging from convex analysis which provide useful ways of dealing with tricky, non-differentiable convex functions. In particular, you can think of the proximal gradient of a given function as an optimization problem that penalizes taking steps "too far" in a given direction. Better yet (and perhaps one of the main useful points) is that most functions we care about have relatively nice proximal gradient operators!

Anyways, for now, let's define the proximal gradient of a function gg at some point xx (usually denoted proxg(x)\text{prox}_g(x), though I will simply call it Pg(x)P_g(x) for shorter notation) to be

Pg(x)arg ⁣miny(g(y)+12xy22) P_g(x) \equiv \mathop{\arg\!\min}\limits_y \left(g(y) + \frac{1}{2}\lVert x- y\lVert_2^2\right)

the definition is useful only because, if we allow g(u)\partial g(u) to be the subdifferential of gg at uu, then optimality guarantees that, if y=Pg(x)y = P_g(x), then (by knowing that the subdifferential must be zero at the optimum) we have

xyg(y). x - y \in \partial g(y).

In other words, 0g(y)0 \in \partial g(y) iff yy is a fixed point of PgP_g—that is, we have reached a minimizer of gg iff

y=Pg(y). y = P_g(y).

Additionally, there's no weird trickery that has to be done with subdifferentials since the result of PgP_g is always unique, which is a nice side-benefit. Using just this, we can already begin to do some optimization. For example, let's consider the (somewhat boring, but enlightening) example of minimization of the 1\ell_1-norm. Using the fact that

u=Pλ(x)    xuu u = P_{\lambda |\cdot|}(x) \iff x - u \in \partial |u|

and using the fact that the 1\ell_1-norm is separable, we have, whenever u>0u>0 (I'm considering a single term of the sum, here)

xu=λ    u=xλ whenever xλ>0 x - u = \lambda \implies u = x-\lambda\text{ whenever } x-\lambda > 0

similarly for the u=0u=0 case we have (where λS\lambda S for some set SS is just multiplication of every element in the set by λ\lambda)

xu=xλ[1,1]=[λ,λ]. x - u = x \in \lambda [-1, 1] = [-\lambda, \lambda].

that is

u=0 whenever xλ u = 0\text{ whenever } |x|\le \lambda

and similarly for the u<0u<0 case, we have u=x+λu = x + \lambda if x<λx < -\lambda. Since this is done for each component, the final operator has action

ui={xiλ,xi>λ0,xiλxi+λ,xi<λ. u_i = \begin{cases} x_i - \lambda, & x_i > \lambda\\ 0, & |x_i| \le \lambda \\ x_i + \lambda, & x_i < -\lambda. \end{cases}

This operator is called the 'shrinkage' operator because of its action on its input: if xix_i is greater than our given λ\lambda, then we shrink it by that amount. Note then, that successively applying (in the same manner as SGD) the update rule

ui+1=P(ui) u^{i+1} = P_{|\cdot|}(u^i)

correctly yields the minimum of the given convex function, i.e. 0. Of course, this isn't particularly surprising since we already know how to optimize the 1\ell_1-norm function, x1\lVert x \lVert_1 (just set it to zero!), but it will help out quite a bit when considering more complicated functions.

Proximal gradient update

Now, given a problem of the form

minxf(x)+g(x) \min_x f(x) + g(x)

where ff is differentiable, and gg is convex, we can write down a possible update in the same vein as the above, except that we now also update our objective for ffand gg at the same time

ui+1=Pγi+1g(uiγi+1f(ui)). u^{i+1} = P_{\gamma^{i+1} g}(u^i - \gamma^{i+1}\nabla f(u^i)).

Here, γi\gamma^i is defined to be the step size for step ii. It turns out we can prove several things about this update, but, perhaps most importantly, we can show that it works.

Anyways, this is all I'll say about the proximal gradient step as there are several good resources on the proximal gradient method around which will do a much better job of explaining it than I probably ever will: see this for example.

Optimizing SVM using proximal gradient

Initial problem

I assume some familiarity with SVMs, but the program given might require a bit of explanation. The idea of an SVM is as a soft-margin classifier (there are hard-magin SVMs, but we'll consider the former variety for now): we penalize the error of being on the wrong side of the decision boundary in a linear form (with zero penalty for being on the correct side of the decision boundary). The only additional thing is that we also require that the margin's size also be penalized such that it doesn't depend overly on a particular variable (e.g. as a form of regularization).

The usual quadratic program for an SVM is, where ξi±\xi^\pm_i are the slack variables indicating how much the given margin is violated, ε>0\varepsilon > 0 is some arbitrary positive constant, and μ\mu is the hyperplane and constant offset found by the SVM (e.g. by allowing the first feature of a positive/negative sample to be (xi±)0=1(x^\pm_i)_0 = 1):

minimizeξ,μiξi++iξi+Cμ2subject toμTxi+εξi+, for all iμTxi+εξi, for all iξi±0, for all i \begin{aligned} & \underset{\xi, \mu}{\text{minimize}} & & \sum_i \xi^+_i + \sum_i \xi^-_i + C\lVert \mu \lVert_2 \\ & \text{subject to} & & \mu^Tx^+_i - \varepsilon \ge -\xi^+_i,\,\,\text{ for all } i \\ &&& \mu^Tx^-_i + \varepsilon \le \xi^-_i,\,\,\text{ for all } i\\ &&& \xi^\pm_i \ge 0,\,\,\text{ for all } i \end{aligned}

we can rerwite this immediately, given that the class that our data point ii corresponds to is yi1,+1y^{i}\in \\{-1, +1\\} to a much nicer form

minimizeξ,μ1Tξ+Cμ2subject toyiμTxiεξi, for all iξi0, for all i \begin{aligned} & \underset{\xi, \mu}{\text{minimize}} & & 1^T \xi + C\lVert \mu \lVert_2 \\ & \text{subject to} & & -y^i \mu^Tx_i - \varepsilon \ge -\xi_i,\,\,\text{ for all } i \\ &&& \xi_i \ge 0,\,\,\text{ for all } i \end{aligned}

and noting that the objective is homogeneous of degree one, we can just multiply the constraints and all variables by 1ε\frac{1}{\varepsilon} which yields the immediate result (after flipping some signs and inequalities)

minimizeξ,μ1Tξ+Cμ2subject toξiyiμTxi+1, for all iξi0, for all i \begin{aligned} & \underset{\xi, \mu}{\text{minimize}} & & 1^T\xi + C\lVert \mu \lVert_2 \\ & \text{subject to} & & \xi_i\ge y^i \mu^Tx_i +1,\,\,\text{ for all } i \\ &&& \xi_i \ge 0,\,\,\text{ for all } i \end{aligned}

which, after changing the 2\ell_2 regularizer to an 22\ell_2^2-norm regularizer (which is equivalent for approriate regularization hyperparameter, say η\eta[1]) yields

minimizeξ,μ1Tξ+Cμ22subject toξiyiμTxi+1, for all iξi0, for all i. \begin{aligned} & \underset{\xi, \mu}{\text{minimize}} & & 1^T \xi + C\lVert \mu \lVert_2^2 \\ & \text{subject to} & & \xi_i\ge y^i \mu^Tx_i +1,\,\,\text{ for all } i \\ &&& \xi_i \ge 0,\,\,\text{ for all } i. \end{aligned}

This is the final program we care about and the one we have to solve using our proximal gradient operator. In general, it's not obvious how to fit the inequalities into a step, so we have to define a few more things.

Set indicators

For now, let's define the set indicator function

gS(x)={0xS+x∉S g_S(x) = \begin{cases} 0 & x\in S\\ +\infty & x\not\in S \end{cases}

which is convex whenever SS is convex; we can use this to encode the above constraints (I drop the SS for convenience in what follows) such that a program equivalent to be above is

minimizeξ,μ1Tξ+Cμ22+g(μ,ξ) \underset{\xi, \mu}{\text{minimize}} \,\, 1^T \xi + C\lVert \mu \lVert_2^2 + g(\mu, \xi)

which is exactly what we wanted! Why? Well:

minimizeξ,μ1Tξ+Cμ22differentiable+g(μ,ξ)convex \underset{\xi, \mu}{\text{minimize}}\,\, \underbrace{1^T \xi + C\lVert \mu \lVert_2^2}_\text{differentiable} + \underbrace{g(\mu, \xi)}_\text{convex}

so now, we just need to find the proximal gradient operator for g(x)g(x) (which is not as nice as one would immediately think, but it's not bad!).

Proximal gradient of the linear inequality indicator

Now, let's generalize the problem a bit: we're tasked with the question of finding the prox-gradient of gS(x)g_S(x) such that SS is given by some set of inequalities S={xAxb}S = \{x\,|\, Ax\le b\} for some given A,bA, b.[2] That is, we require

arg ⁣miny12λxy22+gS(y) \mathop{\arg\!\min}\limits_y \frac{1}{2\lambda}\lVert x- y\lVert_2^2 + g_S(y)

which can be rewritten as the equivalent program (where the 1/2λ1/2\lambda is dropped since it's just a proportionality constant)

minimizeyxy22subject toAyb \begin{aligned} & \underset{y}{\text{minimize}} & & \lVert x- y\lVert_2^2 \\ & \text{subject to} & & Ay\le b \end{aligned}

it turns out this program isn't nicely solvable using the prox-gradient operator (since there's no obvious way of projecting onto AxbAx\le b and also minimizing the quadratic objective). But, of course, I wouldn't be writing this if there wasn't a cute trick or two we could do: note that this program has a strong dual (i.e. the values of the dual program and the primal are equal) by Slater's condition, so how about trying to solve the dual program? The lagrangian is

L=xy22+γT(Ayb) \mathcal{L} = \lVert x- y\lVert_2^2 + \gamma^T(Ay - b)

from which we can derive the dual by taking derivatives over yy:

L=yx+ATγ=0    y=xATγ \nabla\mathcal{L} = y-x + A^T\gamma = 0 \implies y = x - A^T\gamma

and plugging in the above (and simplifying) yields the program

maximizeηηT(Axb)12ATη22subject toη0 \begin{aligned} & \underset{\eta}{\text{maximize}} & & \eta^T(Ax-b) - \frac{1}{2}\lVert A^T\eta\lVert_2^2 \\ & \text{subject to} & & \eta \ge 0 \end{aligned}

from which we can reconstruct the original solution by the above, given:

y=xATη. y = x - A^T\eta.

This program now has a nice prox step, since (Pη0(η))i=max{0,ηi}\left(P_{\eta \ge 0}(\eta)\right)_i = \max\{0, \eta_i\} (the 'positive' part of ηi\eta_i, in other words). This latter case is left as an exercise for the reader.

Final Thoughts

Putting the above together yields a complete way of optimizing the SVM program: first, take a single step of the initial objective, then find the minimum projection on the polygon over the given inequality constraints by using the second method, and then take a new step on the original, initial program presented.

Of course, this serves as little more than an academic exercise (I'm not sure how thrilled either Boyd nor Lall would be at using dual programs in 104, even in a quasi-black-box optimizer), but it may be of interest to people who take interest in relatively uninteresting things (e.g. me) or to people who happen to have really freaking fast proximal gradient optimizers (not sure these exist, but we can pretend).

[1] We're also finding this by using cross-validation, rather than a-priori, so it doesn't matter too much.
[2] Note that this is equivalent to the original problem, actually. The forwards direction is an immediate generalization. The backwards direction is a little more difficult, but can be made simple by considering, say, only positive examples.

Built with Franklin.jl and Julia.