Longest Path Search

PCA as a convex optimization problem

Posted 2018-05-16

It's been a while since I last posted (my posting has been less once every two weeks and more like once every two months), but here's a post I've been sitting on for a while that I never got around to finishing. As per rachelbythebay's advice, I decided to just finish it and post it up. It's likely to be a little rough, but feel free to tweet any questions or things that you'd like more fleshed out (as usual!).

Quick introduction to PCA

Most people know Principal Component Analysis (PCA) as a fast, and easily-scalable dimensionality-reduction technique used quite frequently in machine learning and data exploration—in fact, it's often mentioned that one-layer, linear neural network[1] applied on some data-set recovers the result from PCA.

It's (also) often mentioned that PCA is one of the few non-convex problems that we can solve efficiently, though a (let's say 'non-constructive') answer showing this problem is convex is given in this Stats.SE thread, which requires knowing the eigenvectors of XTXX^TX, a priori. It turns out it's possible to create a fairly natural semi-definite program which actually constructs the solution in its entirety.

Since I'll only give a short overview of the topic of PCA itself, I won't go too much into depth on methods of solving this. But, the general idea of PCA is to find the best low-rank approximation of a given matrix AA. In other words, we want, for some given kk:

minimizeXAXF2subject torank(X)=k, \begin{aligned} & \underset{X}{\text{minimize}} & & \| A - X \|_F^2 \\ & \text{subject to} & & \text{rank}(X) = k, \end{aligned}

where BF2\| B \|_F^2 is the square of the Frobenius norm of BB (i.e. it is the sum of the squares of each entry of BB). Why is this useful? Well, in the general formulation, we can write the SVD decomposition of some optimal XRm×nX^* \in \mathbb{R}^{m\times n},

X=UΣ(V)T X^* = U^*\Sigma^* (V^*)^T

with orthogonal URm×k,VRn×kU^* \in \mathbb{R}^{m\times k}, V^*\in \mathbb{R}^{n\times k} and diagonal ΣRk×k\Sigma^* \in \mathbb{R}^{k\times k}. Then the columns of VV^* represent the kk most important features of AA (assuming that each row of AA is a point of the dataset). This may seem slightly redundant if you already know the punchline, but we'll get there in a second.

For now, define the SVD of AA in a similar way to the above

A=UΣVT. A = U\Sigma V^T.

with orthogonal U,VU, V and diagonal Σ\Sigma.

For convenience, it's easiest to define the diagonal of Σ\Sigma (the singular values of AA) to be sorted with the top-left value being the largest and bottom-right value being the smallest. Then let UkU_k be the matrix which contains only the first kk columns of UU (and similarly for VkV_k), while Σk\Sigma_k is the kk by kk diagonal sub-matrix of Σ\Sigma containing only the first kk values of the diagonal (as usual, starting from the top left).

Now we can get to the punchline I was talking about earlier: it turns out that the SVD of XX^* is the truncated SVD of AA, in other words, if the SVD of AA is UΣVTU\Sigma V^T, then the optimal solution is

X=UkΣkVkT. X^* = U_k\Sigma_kV_k^T.

This is the usual way of computing the PCA decomposition of AA: simply take the SVD and then look at the first kk columns of VV.[2] We'll make use of this fact to show that the optimal values are equal, but it won't be necessary to actually compute the result.

Construction of the SDP

In general, semi-definite programs (i.e. optimization over symmetric, positive-semidefinite matrices with convex objectives and constraints) are convex problems. Here, we'll construct a (relatively) simple reduction of the non-convex problem of PCA, as presented above, to the SDP.

Quick overview of method

This entire idea was interesting to me, since it was mentioned in this lecture which was a result I didn't know about. There aren't any complete proofs of this online, other than a quick mention in Vu, et al. (2013), though it's not hard to show the final result given the general ideas. I highly encourage you to try the proof out after reading only the main ideas, if you're interested!

First, we'll start with the usual program, call it program Y:

minimizeXAXF2subject torank(X)=k, \begin{aligned} & \underset{X}{\text{minimize}} & & \| A - X \|_F^2 \\ & \text{subject to} & & \text{rank}(X) = k, \end{aligned}

and construct the equivalent program (this step can be skipped with a cute trick below), with F=ATAF = A^TA,

minimizePFPF2subject torank(P)=k,P2=P,PT=P, \begin{aligned} & \underset{P}{\text{minimize}} & & \| F - P \|_F^2 \\ & \text{subject to} & & \text{rank}(P) = k,\\ &&& P^2 = P,\\ &&& P^T = P, \end{aligned}

in other words, this is a program over projection matrices PP. This can then be put into the form

maximizePtr(FP)subject torank(P)=k,P2=P,PT=P. \begin{aligned} & \underset{P}{\text{maximize}} & & \text{tr}(FP) \\ & \text{subject to} & & \text{rank}(P) = k,\\ &&& P^2 = P,\\ &&& P^T = P. \end{aligned}

for some matrix FF, and it can be relaxed into the following SDP, let's call it problem Z,

maximizePtr(FP)subject totr(P)=k,0PI, \begin{aligned} & \underset{P}{\text{maximize}} & & \text{tr}(FP) \\ & \text{subject to} & & \text{tr}(P) = k,\\ &&& 0\preceq P \preceq I, \end{aligned}

where ABA \preceq B is an inequality with respect to the semi-definite cone (i.e. AB    BAA \preceq B \iff B - A is positive semi-definite). You can then show that this SDP has zero integrality gap with the above program over the projection matrices. More specifically, any solution to the relaxation can be easily turned into a solution of the original program.

Just a random side-note: if you took or followed Stanford's EE364A course for the previous quarter (Winter 2018), the latter part of this proof idea may seem familiar—it was a problem written for the final exam. My original intent with it was to guide the students through the complete proof, but better judgement prevailed and the question was cut down to only that last part with some hints.

Complete (if somewhat short!) steps

The two interesting points of the whole proof are (a) to realize that any solution of the original problem (program Y) can be written as a solution X=APX = AP' for some projection matrix PP' (which, of course, will turn out to be the projection matrix PP which solves program Z, namely P=VkVkTP' = V_kV_k^T), and (b) to note that we can prove that program Z has zero integrality gap since, if we have a solution to the SDP given by P=UDUTP^* = UDU^T, then we can 'fix' non-integral eigenvalues via solving the problem

maximizexcTxsubject to1Tx=k,0xi1,  i, \begin{aligned} & \underset{x}{\text{maximize}} & & c^Tx \\ & \text{subject to} & & 1^Tx = k,\\ &&& 0\le x_i \le 1, ~~\forall i, \end{aligned}

where ci=(UTAU)iic_i = (U^TAU)_{ii}. This LP has an integral solution xx^* (what should this solution be?) which preserves the objective value of the original problem, so Pˉ=Udiag(x)UT\bar P^* = U\text{diag}(x^*)U^T is a feasible, integral solution to the original problem, with the same objective value as the previous so the SDP relaxation is tight!

Using all of this, then we've converted the PCA problem into a purely convex one, without computing the actual solution beforehand.

[1] More specifically, a one-layer, linear NN with 2\ell_2 loss.
[2] As usual, there are smarter ways of doing this. It turns out one can run a truncated or partial SVD decomposition, which doesn't require constructing all singular values and all the columns of U,VU, V. This is far more efficient whenever kmin{m,n}k\ll \min\{m, n\}, where m,nm,n are the dimensions of the data. This latter condition is usually the case for practical purposes.

Built with Franklin.jl and Julia.