(Or focusing on papers. We'll find out next.)

]]>This post will focus on a particular, nearly silly, proof of a lower bound for the distance of an unbiased random walk, defined as

\[ X = \sum_{i=1}^n X_i, \]where \(X_i \sim \{\pm 1\}\), uniformly. The quantity we want to find a lower bound to is

\[ \mathbf{E}[|X|], \]as \(n\) is large. We know from a basic, if somewhat annoying, counting argument that

\[ \mathbf{E}[|X|] \sim \sqrt{\frac{2}{\pi}}\sqrt{n}, \]when \(n \gg 1\). In general, we're interested in bounds of the form

\[ \mathbf{E}[|X|] \ge \Omega(\sqrt{n}). \]Bounds like these are applicable in a number of important lower bounds for online convex optimization (see, *e.g.*, Hazan's lovely overview, section 3.2) though we won't be talking too much about the applications on this one.

Additionally, since \(\mathbf{E}[X^2] = n\) (which follows by expanding and using the fact that \(X_i\) are independent with mean zero) then

\[ \mathbf{E}[|X|] \le \sqrt{\mathbf{E}[X^2]} = \sqrt{n}, \]so we know that this bound is tight up to a constant. The first inequality here follows from an application of Jensen's inequality to the square root function (which is concave).

Mostly because I'm bad at counting and always end up with a hilarious number of errors. Plus, this proof is easily generalizable to a number of other similar results!

One simple method for lower-bounding the expectation of a variable like \(|X|\) is to note that \(|X|\) is nonnegative, so we have the following 'silly' bound

\[\mathbf{E}[|X|] \ge \mathbf{E}[a\mathbf{1}_{|X| \ge a}] = a \mathbf{Pr}(|X| \ge a), \]for any \(a \ge 0\), where \(\mathbf{1}_{|X| \ge a}\) is the indicator function for the event \(|X| \ge a\), that is 1 if \(|X| \ge a\) and zero otherwise. (The bound follows from the fact that \(|X| \ge a \mathbf{1}_{|X|\ge a}\) pointwise.) Maximizing over \(a\), assuming we have a somewhat tight lower bound over the probability that \(|X| \ge a\), then this approach might give us a reasonable lower bound.

In a very general sense, we want to show that \(|X|\) is 'anticoncentrated'; *i.e.*, it is reasonably 'spread out', which would indicate that its expectation cannot be too small, since it is nonnegative.

The first idea (or, at least, my first idea) would be to note that, since \(\mathbf{E}[X^2]\) is on the order of \(n\), then maybe we can use this fact to construct a bound for \(\mathbf{E}[|X|]\) which 'should be' on the order of \(\sqrt{n}\) assuming some niceness conditions, for example, that \(|X| \le n\) is a bounded variable.

Unfortunately, just these two simple facts are not enough to prove the claim! We can construct a nonnegative random variable \(Y\ge 0\) such that its second moment is \(\mathbf{E}[Y^2] = n\), it is bounded by \(Y \le n\), yet \(\mathbf{E}[Y] = 1\). In other words, we wish to construct a variable that is very concentrated around \(0\), with 'sharp' peaks at larger values.

Of course, the simplest example would be to take \(Y = n\) with probability \(1/n\) and \(Y=0\) with probability \(1-1/n\). Clearly, this variable is bounded, and has \(n\) as its second moment. On the other hand,

\[ \mathbf{E}[Y] = (1/n)n + (1-1/n)0 = 1, \]which means that the best bound we can hope for, using just these conditions (nonnegativity, boundedness, and second moment bound) on a variable, is a constant. (Indeed, applying a basic argument, we find that this is the smallest expectation possible.)

This suggests that we need a little more control over the tails of \(|X|\), which gets us to...

Another easy quantity to compute in this case is \(\mathbf{E}[X^4]\). (And, really, any even power of \(X\) is easy. On the other hand, since \(X\) has a distribution that is symmetric around 0, all odd moments are 0.) Splitting the sum out into each of the possible quartic terms, we find that any term containing an odd power of \(X_i\) will be zero in expectation as the \(X_i\) are independent. So, we find

\[ \mathbf{E}[X^4] = \sum_{i} \mathbf{E}[X_i^4] + \sum_{i\ne j} \mathbf{E}[X_i^2X_j^2] = n + n(n-1) = n^2. \]This quantity will come in handy soon.

We can, on the other hand, split up the expectation of \(X^2\) in a variety of ways. One is particularly handy to get a tail *lower bound* like the one we wanted in our proof idea (above):

The latter term can be upper bounded using Cauchy–Schwarz,^{[1]}

(Since \(\mathbf{1}_{|X| \ge a}^2 = \mathbf{1}_{|X| \ge a}\).) And, since \(\mathbf{E}[\mathbf{1}_{|X| \ge a}] = \mathbf{Pr}(|X| \ge a)\), we finally have:

\[ \mathbf{E}[X^2] \le a^2 + \sqrt{\mathbf{E}[X^4]}\sqrt{\mathbf{Pr}(|X| \ge a)}. \]Rearranging gives us the desired lower bound,

\[ \mathbf{Pr}(|X| \ge a) \ge \frac{(\mathbf{E}[X^2] - a^2)^2}{\mathbf{E}[X^4]}. \](This is a Paley–Zygmund-style bound, except over \(X^2\) rather than nonnegative \(X\).)

Now, since we know that

\[ \mathbf{E}[|X|] \ge a \mathbf{Pr}(|X| \ge a), \]then we have

\[ \mathbf{E}[|X|] \ge a \frac{(\mathbf{E}[X^2] - a^2)^2}{\mathbf{E}[X^4]}. \]Parametrizing \(a\) by \(a = \alpha\sqrt{\mathbf{E}[X^2]}\) for some \(0 \le \alpha \le 1\), we then have

\[ \mathbf{E}[|X|] \ge \alpha(1-\alpha^2)^2\frac{\mathbf{E}[X^2]^{3/2}}{\mathbf{E}[X^4]}. \]The right-hand-side is maximized at \(\alpha = 1/\sqrt{5}\), which gives the following lower bound

\[ \mathbf{E}[|X|] \ge \frac{16}{25\sqrt{5}}\frac{\mathbf{E}[X^2]^{3/2}}{\mathbf{E}[X^4]}. \]And, finally, using the fact that \(\mathbf{E}[X^2] = n\) and \(\mathbf{E}[X^4] = n^2\), we get the final result:

\[ \mathbf{E}[|X|] \ge \frac{16}{25\sqrt{5}}\sqrt{n} \ge \Omega(\sqrt{n}), \]as required, with no need for combinatorics! Of course the factor of \(16/(25\sqrt{5}) \approx .29\) is rather weak compared to the factor of \(\sqrt{2/\pi} \approx .80\), but this is ok for our purposes.

Of course, similar constructions also hold rather nicely for things like uniform \([-1, 1]\) variables, or Normally distributed, mean zero variables. Any variable for which the second and fourth moment can be easily computed allows us to compute a lower bound on this expectation. (Expectations of the absolute value of the sums of independently drawn versions of these variables could be similarly computed.) These have no obvious combinatorial analogue, so those techinques cannot be easily generalized, whereas this bound applies immediately.

[1] | Possibly the most elegant proof of Cauchy–Schwarz I know is based on minimizing a quadratic, and goes a little like this. Note that \(\mathbf{E}[(X - tY)^2]\ge 0\) for any \(t \in \mathbf{R}\). (That this expectation exists can be shown for any \(t\) assuming both \(X\) and \(Y\) have finite second moment. If not, the inequality is also trivial.) Expanding gives \(\mathbf{E}[X^2] - 2t\mathbf{E}[XY] + t^2\mathbf{E}[Y^2] \ge 0\). Minimizing the left hand side over \(t\) then shows that \(t^\star = \mathbf{E}[XY]/\mathbf{E}[Y^2]\), which gives
\[ \mathbf{E}[X^2] - \frac{\mathbf{E}[XY]^2}{\mathbf{E}[Y^2]} \ge 0.\]
Multiplying both sides by \(\mathbf{E}[Y^2]\) gives the final result. |

One specific case in which this question is useful is in the field bounds paragraph in pages 18 and 19 of the paper, though more generally this question can also help answer a number of other important results (which we do not mention here).

Unfortunately, this post will end in a bit of a disappointing note: the result given here depends on some condition which is likely not easy to check in practice. On the other hand, it does lead to a suggestive definition I have never seen before of an "elementwise nonexpansive operator." I would be quite curious to see if anyone had any references!

In this case, we will again focus on the *diagonal physics equation*. Here, the physics equation is:

where \(\theta \in \mathbf{R}^n\) are the design parameters (usually the permittivities in many problems) while \(z \in \mathbf{R}^n\) are the fields. In general, we are only allowed to choose parameters within a certain range, so we will write this as

\[ -1 \le \theta \le 1, \]without loss of generality. (In particular, if \(\theta\) is constrained to lie within any range, we can always rescale the physics equation to make \(\theta\) lie between \(-1\) and \(1\). For more details on how to do this, see section 1.2 of the paper.)

Taking this formulation, we can "eliminate" the design parameter. In other words, we will write a number of equations, depending only on the variable \(z\), such that, when \(z\) satisfies all the equations, there exists some design \(\theta \in [-1, 1]^n\) which makes the physics equation true.

To do this, note that, we can take the initial physics equation and rearrange it as follows:

\[ Az - b = -\mathbf{diag}(\theta)z. \]Taking the elementwise absolute value of both sides gives

\[ |Az - b| = |\mathbf{diag}(\theta)z|. \](Here, we interpret \(|\cdot|\) to be elementwise.) Because \(|\theta| \le 1\), we can see that

\[ |\mathbf{diag}(\theta)z| \le |z|, \]so

\[ |Az - b| \le |z|. \]In fact, \(|\mathbf{diag}(\theta)z| \le |z|\) is true if, and only if, \(|\theta| \le 1\). Meaning the inequality we just derived, depending only on \(z\), is true if, and only if, there exists some design \(\theta\) satisfying \(|\theta| \le 1\) that makes the original physics equation true.

The nice part about this equation is that it encapsulates all of the important parts of the problem in a simple-to-reason-about format. (It also suggests some interesting heuristics, but that's for another time!)

Now, finally, to answer the question! At least partially.

The proof technique here is relatively simple: we will

]]>I've used this trick before in a few other papers, with the main example being a paper coauthored with Kunal Shah and Mac Schwager, found here, specifically in equation (8) and below, starting on page 10.

The basic (very general!) idea is to replace a "min-max" optimization problem with a "min" optimization problem. For example, say we are given the following optimization problem

\[ \begin{aligned} & \text{minimize} && \max_{g(y) \le 0} f(x, y)\\ & \text{subject to} && h(x) \le 0, \end{aligned} \]with variables \(x \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\) and functions \(f : \mathbb{R}^m \times \mathbb{R}^n \to \mathbb{R}\), \(g:\mathbb{R}^n \to \mathbb{R}\), and \(h: \mathbb{R}^m \to \mathbb{R}\). Now, consider the usual Lagrangian of the "inner problem" (the one with the max over \(y\)), which we know is

\[ L(x, y,\lambda) = f(x, y) - \lambda g(y). \]If we define

\[ \bar f(x, \lambda) = \sup_{y} L(x, y, \lambda), \]then, for any \(\lambda \ge 0\) and any feasible \(y\) (*i.e.*, \(y\) that satisfies \(g(y) \le 0\)), we have that

(The first inequality follows from the definition of \(\sup\) while the last inequality follows since \(\lambda \ge 0\) and \(g(y) \le 0\) which means that \(\lambda g(y) \le 0\).) So, for any \(x\), we know that, if you give me any \(\lambda \ge 0\), then \(\bar f(x, \lambda)\) is an "overestimator" of \(f(x, y)\) for any feasible \(y\).

But, since this is true for any \(\lambda \ge 0\) and \(x\), then certainly

\[ \inf_{\lambda \ge 0} f(x, \lambda) \ge \sup_{g(y) \le 0} f(x, y), \]for any \(x\).^{[2]}

If we use our new overestimator \(\bar f\) instead of \(f\), our new problem is now a simple optimization problem

\[ \begin{aligned} & \text{minimize} && \bar f(x, \lambda)\\ & \text{subject to} && h(x) \le 0\\ &&& \lambda \ge 0, \end{aligned} \]that is not in min-max form and requires no other special techniques to solve. The optimal value of this problem need not be the same as that of the original, but is always guaranteed to be at least as large.

Of course, this *can* help, but it certainly doesn't solve our problem. We just need one more piece to the puzzle!

If you've studied a bit of convex analysis the punchline is this: the inequality we have for \(f\) and \(\bar f\) above holds exactly at equality when \(f\) is concave in \(y\) and \(g\) is convex in \(y\). More specifically

\[ \inf_{\lambda \ge 0} \bar f(x, \lambda) = \sup_{g(y) \le 0} f(x, y), \]for any \(x\).^{[3]}

When this is true, the new problem has the same optimal value as the original and any solution \(x\) for the original is a solution to the new problem! (Why?)

I won't cover more since the specifics don't matter too much, but the general idea is simple enough, and we now have all the parts to convert the min-max problem (4) of Cetegen and Stuber to a simple convex optimization problem.

Things are mostly algebra from here on out, so apologies in advance, I guess. I will leave much of the "hard" work to the reader :)

The complete problem (4) is, as written in the paper, using (mostly) their notation:

\[ \begin{aligned} & \text{minimize} && \max_{M \in \mathcal{M}} x^TMx \\ & \text{subject to} && r^Tx \ge r_\mathrm{min}\\ &&& 1^Tx = 1\\ &&& 0 \le x \le 1, \end{aligned} \]with variable \(x \in \mathbb{R}^n\), where \(r\) is some vector of returns (but the specifics don't matter) and \(\mathcal{M}\) is:

\[ \mathcal{M} = \{M \ge 0 \mid M^-_{ij} \le M_{ij} \le M^+_{ij}, ~~ i, j=1, \dots, n\}. \]In other words, \(\mathcal{M}\) is the set of positive semidefinite matrices (\(M \ge 0\)) whose entries lie between those of \(M^-\) and \(M^+\). I've also dropped some constant terms in the objective since those don't change the problem.

In this case, the "inner" optimization problem is the one in the objective, which is just

\[ \begin{aligned} & \text{maximize} && x^TMx \\ & \text{subject to} && M^-_{ij} \le M_{ij} \le M^+_{ij}, ~~ i, j=1, \dots, n\\ &&& M \ge 0, \end{aligned} \]with variable \(M \in \mathbb{R}^{n\times n}\). We can easily write a (slightly not canonical) Lagrangian:

\[ L(x, M, \Lambda^+, \Lambda^-) = x^TMx - \mathrm{tr}(\Lambda^+(M - M^+)) + \mathrm{tr}(\Lambda^-(M - M^-)), \]where \(\Lambda^+, \Lambda^- \in \mathbb{R}^{n\times n}_+\) are elementwise nonnegative. (The Lagrangian is non-canonical because I have not included the constraint \(M \ge 0\), which we will enforce below.) It is not hard to show that

\[ \sup_{M \ge 0} L(x, M, \Lambda^+, \Lambda^-) = \begin{cases} \mathrm{tr}(\Lambda^+M^+) - \mathrm{tr}(\Lambda^-M^-) & xx^T \le \Lambda^+ - \Lambda^-\\ + \infty & \text{otherwise}. \end{cases} \]As before, the inequality between matrices is with respect to the semidefinite cone.

Plugging this back into the original problem formulation, we now have a convex optimization problem:

\[ \begin{aligned} & \text{minimize} && \mathrm{tr}(\Lambda^+M^+) - \mathrm{tr}(\Lambda^-M^-)\\ & \text{subject to} && r^Tx \ge r_\mathrm{min}\\ &&& xx^T \le \Lambda^+ - \Lambda^-\\ &&& 1^Tx = 1\\ &&& 0 \le x \le 1\\ &&& \Lambda^+_{ij}, \Lambda^-_{ij} \ge 0, \quad i,j =1, \dots, n. \end{aligned} \]The (extra!) variables \(\Lambda^+, \Lambda^- \in \mathbb{R}^{n\times n}\) are included along with the original variable \(x \in \mathbb{R}^n\), and the same problem data as before. This problem, by use of the Schur complement applied to the semidefinite inequality, is easily recast into standard SDP form and can be solved by most standard convex optimization problem solvers, such as SCS or Mosek.

Quick edit (3/18/22): Thanks to bodonoghue85 for finding a typo!

[1] | See, e.g. page 429 of this paper, where they seem to re-prove basic things like "the supremum of a bunch of convex functions is convex," and "partial minimization of a convex function is convex," but I'm honestly not 100% sure. |

[2] | This is called weak duality, c.f. sections 5.1.3 and 5.2.2 of Convex Optimization. |

[3] | This is often referred to as strong duality, c.f. section 5.3.2 of Convex Optimization. There are some additional conditions for equality to hold, but these are almost always met in practice. |

There are many notes and posts talking about the fact that sorting will, in the worst case, always take \(\Omega(n \log n)\) comparisons; equivalently, in the worst case, the number of comparisons is about a constant factor away from \(n \log n\) when \(n\) is large. In many cases, the proof presented depends on the fact that the sorting algorithm is deterministic and goes a little like this (see here for a common case):

Let \(C_k\) be the \(k\)th step of a deterministic sorting algorithm (this can be represented, for example, as a tuple containing what the next comparison should be) with input \(L \in K^n\) where \(L\) is a list of \(n\) comparable elements. (For example, \(L\) can be a list of real numbers, in which case \(K = \mathbb{R}\).)

By the definition of a deterministic algorithm, \(C_k\) depends only on the past \(k-1\) comparisons; *i.e.*, \(C_k(C_{k-1}, C_{k-2}, \dots, C_1)\). (I am slightly overloading notation here, of course, but the meaning should be clear.) This means that we can view the behavior of the algorithm as a tree, where \(C_k\) is a child node of \(C_{k-1}\) which itself is a child node of \(C_{k-2}\), etc. Additionally, the tree is *binary* since the output of a comparison is only one of two possibilities (if \(C_k = (a, b)\), then either \(a\le b\) or \(b \le a\)).

Finally, let the leaf nodes of this tree be the list of indices (say, \(p\), where each entry is an index, \(p_i \in \{1, \dots, n\}\) for \(i=1, \dots, n\)) such that the list permuted at these indices is sorted, \(L_{p_1} \le L_{p_2} \le \dots, \le L_{p_n}\). Note that the number of nodes needed to get from the root node to a given leaf node (or permutation) is exactly the number of comparisons that the algorithm makes before it returns a specific permutation. If we can show that the height (the length of the longest path from root to leaf) of the tree is always larger than about \(n \log n\), then we've shown that this algorithm must take at least \(n \log n\) steps.

The idea for the bound is pretty simple: since this algorithm is a sorting algorithm and it can receive *any* unsorted list, then each of the \(n!\) possible permutations must be a leaf node of the tree. (Why?) Additionally, the maximum number of leaves for a binary tree of height \(h\) is \(2^h\), which, in turn, implies that we must have \(2^h \ge n!\). Taking the log of both sides shows that:

which is exactly what we wanted to show. (The second equality is shown, for example, in the original reference for this proof; see page 2.)

To prove this statement we only really used a few things: (a) the algorithm has to decide between \(n!\) things, and (b) at each query, it only receives a "yes" or a "no" (as we will soon make rigorous, it only gains 1 bit of information from each query). The rest of the proof simply sets up scaffolding for the remaining parts, most of which is really somewhat orthogonal to our intuition. The point is: look, we have \(n!\) things we have to decide on and every time we ask a question, we cut down our list of possible true answers by about half. How many times do we need to cut down our list to be able to have exactly one possible answer? (Of course, as we showed before, this should be \(\log_2(n!)\).)

Now, a good number of writings (and some textbooks) I've seen simply end here by saying "the algorithm gains at most 1 bit for every query and we need at least \(\log(n!) \sim n \log n\) bits" and then give some vague citation to Shannon's theorem about communication without explicitly writing out any argument. (While there is some connection, it's unclear what it is or how to even make it rigorous in a general sense.) It's a bit of a shame since it doesn't really take much math to fully justify this intuitive statement, as we will see next.

The idea behind a (simple) information theoretic approach is to view the algorithm as attempting to 'uncover' the true, sorted permutation by querying an oracle that knows whether two elements are in sorted order or not. Here, the oracle gets to choose some 'true' permutation uniformly at random and then the sorting algorithm queries the oracle in a restricted sense: it can only ask yes or no questions about the list.

More formally, we will let \(\mathcal X\) be the set of all \(n!\) possible permutations of \(1\) through \(n\) and let \(X\) the oracle's permutation such that \(X \sim \mathcal X\) is uniformly randomly sampled from \(\mathcal X\). Then, the algorithm gets to ask a sequence of yes/no queries \(Y_i \in \{0, 1\}\) for \(i=1, \dots, k\) which are dependent on the permutation \(X\), and, at the end of \(k\) queries, must respond with some 'guess' for the true permutation \(\hat X\), which is a random variable that depends only on \(Y\).

We can represent the current set up as a Markov chain \(X \to Y \to \hat X\), since \(\hat X\) is conditionally independent of \(X\) given \(Y\) (*i.e.*, the algorithm can only use information given by \(Y\)) while the random variable \(Y\) depends only on \(X\). The idea then is to give a bound on the number of queries \(k\) required for an algorithm to succeed with probability 1.

To give a lower bound on this quantity, we'll use a tool from information theory called Fano's inequality, which, surprisingly, I don't often see taught in information theory courses. (Perhaps I haven't been taking the right ones!)

I learned about this lovely inequality in John Duchi's class, EE377: *Information Theory and Statistics* (the lectures notes for the class are here). Its proof really makes it clear why entropy, as defined by Shannon, is pretty much the exact right quantity to look at. We'll explore a weaker version of it here that is simpler to prove and requires fewer definitions but which will suffice for our purposes.

We will use \(X, Y\) as random variables and set \(H\) as the entropy, defined:

\[ H(X) = -\sum_{X \in \mathcal{X}}P(X) \log P(X), \]where \(\mathcal{X}\) is the space of values that \(X\) can take on. The conditional entropy of \(X\) given \(Y\) is defined as

\[ H(X|Y) = -\sum_{Y \in \mathcal Y} P(Y) \sum_{X \in \mathcal X} P(X\mid Y) \log P(X\mid Y) = H(X, Y) - H(Y). \]As usual, the entropy \(H\) is a measure of the 'uncertainty' in the variable \(X\), with the maximally uncertain distribution being the uniform one.^{[1]} Additionally, note that \(H(X, Y)\) is the entropy taken with respect to the joint distribution of \(X\) and \(Y\). Finally, if \(X\) is a deterministic function of \(Y\), then \(H(X\mid Y) = 0\) which follows from the definition.

For this post, we will only make use of the following three properties of the entropy (I will not prove them here, as there are many available proofs of them, including the notes above and Cover's famous *Elements of Information Theory*):

The entropy can only decrease when removing variables, \(H(X, Y) \ge H(X)\).

The entropy is smaller than the log of the size of the sample space \(H(X \mid Y) \le H(X) \le \log|\mathcal X|\). (Equivalently, conditioning reduces entropy, and the uniform distribution on \(\mathcal X\) has the highest possible entropy, \(\log|\mathcal X|\).)

If a random variable \(\hat X\) is conditionally independent of \(X\) given \(Y\) (

*i.e.*, if \(X \to Y \to \hat X\) is a Markov chain), then \(H(X\mid \hat X) \ge H(X\mid Y)\). This is often called a data processing inequality, which simply says that \(X\) has smaller entropy knowing \(Y\) than knowing a variable, \(\hat X\), that has undergone further processing. In other words, you cannot gain more information about \(X\) from a variable \(Y\) by further processing \(Y\).

This is all we need to prove the following inequality. Let \(X \to Y \to \hat X\) be a Markov chain such that \(\hat X\) is conditionally independent of \(X\) given \(Y\) and \(X\) is uniformly drawn from \(\mathcal X\), then the probability that \(X \ne \hat X\) is given by \(P_e\) and \(P_e\) satisfies

\[ P_e \ge 1 - \frac{k+1}{\log |\mathcal X|}, \]where \(k\) is the number of binary queries made and \(|\mathcal X|\) is the number of elements in \(\mathcal X\).

Proving this inequality is enough to show the claim. Note that, we want the probability of error to be 0 (since we want our algorithm to work!) so

\[ 1 - \frac{k+1}{\log |\mathcal X|} \le 0 \quad \text{implies} \quad k+1 \ge \log |\mathcal X|, \]and, since \(|\mathcal X|\) is the space of possible permutations of size \(n\), (of which there are \(n!\) of) then \(|\mathcal X| = n!\) and \(\log n! \sim \Omega(n \log n)\) then \(k\) must satisfy

\[ k = \Omega(n \log n). \]In other words, the number of queries (or comparisons, or whatever) \(k\) must be approximately at least as large as \(n \log n\), asymptotically (up to constant multiples). In fact, this proves a slightly *stronger* statement that no probabilistic algorithm can succeed with nonvanishing probability if the number of queries is not on the order of \(n \log n\), which our original proof above does not cover!

Now, the last missing piece is showing that the probability of error is bounded from below.

This is a slightly simplified version of the proof presented in Duchi's notes (see section 2.3.2) for the specific case we care about, which requires less additional definitions and supporting statements. Let \(E\) be the 'error' random variable that is \(1\) if \(X \ne \hat X\) and 0 otherwise, then let's look at the quantity \(H(E, X\mid \hat X)\):

\[ H(E, X\mid \hat X) = H(X\mid \hat X, E) + H(E\mid \hat X), \]by our definition of conditional entropy, and

\[ H(X\mid \hat X, E) = P_e H(X\mid \hat X, E=1) + (1-P_e)H(X\mid \hat X, E=0), \]again by the same definition. Since \(X = \hat X\) whenever there is no error, \(E=0\), then \(H(X\mid \hat X, E=0) = 0\) since \(X\) is known, so, we really have

\[ H(E, X\mid \hat X) = P_e H(X\mid \hat X, E=1) + H(E\mid \hat X). \]Since \(E\) can only take on two values, we have that \(H(E\mid \hat X) \le \log(2) = 1\) and we also have that \(H(X\mid \hat X, E=1) \le \log |\mathcal X|\), which gives

\[ H(E, X\mid \hat X) \le P_e \log|\mathcal X| + 1. \]Now, we have that

\[ H(E, X\mid \hat X) \ge H(X\mid \hat X) \ge H(X\mid Y). \]The first inequality follows from the fact that we're removing a variable and the second follows from statement 3 in the previous section (as \(X \to Y \to \hat X\)). Using the definition of \(H(X\mid Y)\), then we have

\[ H(X, Y) - H(Y) \ge H(X) - H(Y) = \log |\mathcal X| - H(Y) \ge \log |\mathcal X| - \log|\mathcal Y|. \]The first inequality here follows since we're (again!) removing a variable and the equality follows from the fact that \(X\) is uniformly randomly drawn from \(\mathcal X\) and the last inequality follows from the fact that the entropy of \(Y\) is always smaller than that of the uniform distribution on \(\mathcal Y\). Finally, note that, if we have \(k\) queries, then \(|\mathcal Y| = 2^k\) (this is the number of possible values a sequence of \(k\) binary queries can take on). So, \(\log |\mathcal Y| = k\) (in other words, the maximum amount of information we can get with \(k\) binary queries is \(k\) bits) so we find

\[ \log |\mathcal X| - k \le P_e\log |\mathcal X| + 1, \]or, after some slight rewriting:

\[ P_e \ge 1 - \frac{k+1}{\log |\mathcal X|}, \]as required!

I think overall that Fano's inequality is a relatively straightforward way of justifying a ton of the statements one can make about information theory without needing to invoke a large number of more complicated notions. Additionally, the proof is relatively straightforward (in the sense that it only requires very few definitions and properties of the entropy) while also matching our intuition about these problems pretty much exactly.

In particular, we see that sorting is hard not because somehow ordering elements is difficult, but because we have to decide between a bunch of different items (in fact, \(n!\) of them) while only receiving a few bits of information at any point in time! In fact, this bound applies to *any* yes/no query that one can make of the sorted data, not just comparisons, which is interesting.

There are some even more powerful generalizations of Fano's inequality which can also be extended to machine learning applications: you can use them to show that, given only some small amount of data, you cannot decide between a parameter that correctly describe the data and one that does not.

This is all to say that, even though entropy is a magical quantity, that doesn't mean we can't say very rigorous things about it (and make our intuitions about lower bounds even more rigorous, to boot!).

[1] | In fact, the entropy is really a measure of how close a variable is to the uniform distribution, in the case of compact domains \(\mathcal X\)—the higher the entropy, the closer it is. |

While this question has been explored somewhat extensively, the exposition is often more general than necessary and aimed at a relatively mathematical audience. Either way, if you're interested, both papers are fairly well-written—I highly recommend at least a quick skim!

The S-procedure is a well known lemma in control theory that seeks to answer the following question:

Let's say we have a bunch of quadratic functions \(f_0, f_1, f_2, \dots, f_n : \mathbb{R}^m \to \mathbb{R}\). When is it true that

\[ f_i(x) \le 0 ~~ \text{for $i=1, \dots, n$} \implies f_0(x) \le 0, \]for \(x \in \mathbb{R}^m\)? (Recall that a quadratic is a function of the form \(f(x) = x^TPx + 2q^Tx + r\) for symmetric \(P \in \mathbb{R}^{m\times m}\), \(q \in \mathbb{R}^m\), and \(r \in \mathbb{R}\)).

There are many reasons to attempt to answer this (surprisingly useful) question. The original motivations were to show stability of systems, though the domain of applications is certainly larger. We can use this to show anything from impossibility results (for example, many of the results of this paper can be recast in terms of the S-procedure) to, well, in our case, the construction of a small covering ellipsoid from a bunch of other ellipsoids, which is itself useful for things like filtering (for localizing drones from noisy measurements for example) along with many other applications.

If you're familiar with Lagrange duality, this is mostly an equivalent statement—except that this statement is in the special case of quadratics, where you can say a little more than with general functions.

We can fully and completely answer this question in the case that \(n=1\): there exists a nonnegative number \(\tau \ge 0\) such that

\[ f_0(x) \le \tau f_1(x) \]for all \(x\) if, and only if, \(f_1(x) \le 0 \implies f_0(x) \le 0\).

Why? Well, let say we have a \(\tau\ge 0\) that satisfies the above inequality. Then, if we have an \(x\) such that \(f_1(x) \le 0\), then

\[ f_0(x) \le \tau f_1(x) \le \tau0 = 0. \]The converse is slightly trickier, so I will defer to B&V's *Convex Optimization* which has a very readable presentation of the proof (see B.1 and B.2 in the appendix).

The general case is really only a slight change from the \(n=1\) case (except that the converse of the statement is not true). In particular, if there exist \(\lambda \ge 0\) such that

\[ f_0(x) \le \sum_i \lambda_i f_i(x) ~~ \text{for all $x \in \mathbb{R}^m$}, \]then, \(f_i(x) \le 0 ~ \text{for} ~ i = 1, \dots, n \implies f_0(x) \le 0\). Showing this is nearly the same as the \(n=1\) case,

\[ f_0(x) \le \sum_i \lambda_i f_i(x) \le \sum_i \lambda_i 0 = 0. \]So now we have a family of sufficient (but not necessary!) conditions for which we know when \(f_i(x) \le 0 ~ \text{for} ~ i = 1, \dots, n\) implies that \(f_0(x) \le 0\).

Ellipsoids are a particularly nice family to work with since, as you may have guessed, they are the sets defined by

\[ \mathcal{E} = \{x \mid f(x) \le 0\}, \]where \(f: \mathbb{R}^m \to \mathbb{R}\) is a convex quadratic. This definition gives us a way of translating statements about sets (inclusion, etc) into statements about the functions which generate them. In particular, if we have two ellipsoids \(\mathcal{E}, \mathcal{E}_0 \subseteq \mathbb{R}^n\) defined by the convex quadratics \(f, f_0\), then

\[ \mathcal{E} \subseteq \mathcal{E}_0 \iff (f(x) \le 0 \implies f_0(x) \le 0). \]But wait a minute, we know exactly when this happens! By the previous section, we found that

\[ f(x) \le 0 \implies f_0(x) \le 0, \]if and only if there is some \(\tau \ge 0\) with \(f_0(x) \le \tau f(x)\). Also note that if we have a union of a bunch of ellipsoids (say \(\mathcal{E}_1, \dots, \mathcal{E}_m\)) that we want to cover with an ellipsoid \(\mathcal{E}_0\), then this is the same as saying

\[ \mathcal{E}_i \subseteq \mathcal{E}_0, ~\text{for $i=1, \dots, m$}, \]or, that each ellipsoid is covered by the big one, \(\mathcal{E}_0\).

Ok, to reiterate, we are looking for a small ellipsoid \(\mathcal{E}_0 = \{x \mid f_0(x) \le 0\}\) such that \(\mathcal{E}_0\) contains all of the other ellipsoids \(\mathcal{E}_i = \{x \mid f_i(x) \le 0\}\), where the \(f_i\) and \(f_0\) are convex quadratics. In other words, using the results of the previous subsection, we look for a quadratic \(f_0\) such that

\[ (f_i(x) \le 0 \implies f_0(x) \le 0) ~~ \text{for each $i$} \]which we know happens only when there exists some \(\tau_i \ge 0\) with

\[ f_0(x) \le \tau f_i(x) ~~ \text{for each $i$ and all $x$}. \]Now remains the final question: given two quadratics, \(f_i\) and \(f_0\) and some number \(\tau \ge 0\), how can we check if \(f_0(x) \le \tau f_i(x)\) for all \(x\)? I won't prove this (though I have written a quick proof of this statement in my notes, found here), but, if we let \(f_i(x) = x^TP_ix + 2q_i^Tx + r_i\) and \(f_0(x) = x^TP'x + 2(q')^Tx + r'\) then \(f_0(x) \le f_i(x)\) for all \(x\) if, *and only if*,

where we say two symmetric matrices \(A, B\) satisfy \(A \le B\) whenever \(x^TAx \le x^TBx\) for all \(x\). A straightforward exercise it to verify that the set of matrices \(A \ge 0\) is a convex cone (almost universally called the positive semidefinite or PSD cone).

This rewriting is extremely useful, since we've turned a problem over a potentially difficult-to-handle space (the space of quadratics greater than or equal to another) into a problem that is easy to handle (the PSD cone). The best news, though, is that we have efficient algorithms to solve optimization problems whose constraints are PSD constraints.^{[1]}

Finally, after enough background, we can get to the final goal: writing an efficiently-solvable optimization problem to give us a small bounding ellipsoid.

There are several ways of defining "small," in the case of ellipsoids, but one of the most common definitions is to pick the ellipsoid with the smallest volume. In the case that \(\mathcal{E}_0 = \{x \mid x^TP'x + 2(q')^Tx + r \le 0\}\), the *volume* of this ellipsoid is given by the determinant, \(\mathop{\mathrm{det}} P'\), of the matrix \(P'\). So, we can write—using the conditions given above—an optimization problem corresponding to finding the smallest (in volume) ellipsoid \(\mathcal{E}_0\) which contains all ellipsoids, \(\mathcal{E}_i\) as

The only problem here (which we can easily fix) is that the determinant is not a convex function. On the other hand, the *log* determinant *is* (for a proof, see the Convex book, section 3.1.5), so we can write,

This is equivalent to the original problem since \(\log(y)\) is an increasing function of \(y\).

Of course, any convex function (such as, for example the trace) would do here as well.

Ok, now we know how to solve the problem where we have a bunch of ellipsoids and we want to find an ellipsoid which covers all of them. How about the problem where we want to find an ellipsoid which also covers the sets \(\{N_i\}\) for \(i=1, \dots, k\), which are, themselves, intersections of ellipsoids?

In particular, if \(N_i\) is defined as

\[ N_i = \bigcap_{j \in I_i} \mathcal{E}_j, \]for some index set \(I_i \subseteq \{1, \dots, n\}\) and some set of ellipsoids \(\{\mathcal{E}_j\}\), each of which are defined as before (\(\mathcal{E}_j = \{x \mid f_j(x) \le 0 \}\)), we can perform a similar trick to the one above!

More generally, if we have an ellipsoid \(\mathcal{E}_0 = \{x \mid f_0(x) \le 0\}\), then

\[ N_i \subseteq \mathcal{E}_0 \iff (f_j(x) \le 0 ~~ \text{for $j \in I_i$} \implies f_0(x) \le 0). \](It's a worthwhile exercise to think about why, but it follows the same idea as before.) In other words, \(\mathcal{E}_0\) is a superset of \(N_i\) only when *a bunch of quadratic inequalities imply another*. (Where have we seen this before...?)

In other words, we know that (by the first section), if there exist \(\lambda \ge 0\) such that

\[ f_0(x) \le \sum_{j \in I_i} \lambda_{j} f_j(x), \]then we immediately have that \(N_i \subseteq \mathcal{E}_0\). Since the converse is not true, we are sadly not guaranteed to actually find the smallest bounding ellipsoid \(\mathcal{E}_0\), but this is usually quite a good approximation (if it's not exact).

Following exactly the same steps as in the previous section and using the same definitions, we now get a new program for minimizing the volume for the union and intersection of ellipsoids:

\[ \begin{aligned} & \underset{P', q', r', \lambda}{\mathrm{minimize}} & & \log \mathop{\mathrm{det}} P' \\ & \text{subject to} & & \begin{bmatrix} P' & q\\ (q')^T & r' \end{bmatrix} \le \sum_{j \in I_i}\lambda_{ij}\begin{bmatrix} P_j & q_j\\ q_j^T & r_j \end{bmatrix}, \quad i = 1, \dots, k \\ &&& \lambda_{ij} \ge 0, \quad i=1, \dots, k, ~~ j \in I_i. \end{aligned} \]As before, this program does not guarantee actually finding the minimal volume ellipsoid, but it is likely to be quite close! (That is, if it's not spot on, most of the time.)

[1] | For more information, see Boyd's Linear Matrix Inequalities book. |

The main result of the above paper is kind of weird: essentially, it turns out that you can say what devices are physically *impossible* by phrasing certain problems as optimization problems and then using some basic tools of optimization to derive lower bounds.

To illustrate: imagine you want to generate an engine which is as efficient as possible, then we know the best you could possibly hope to do is given by the second law of thermodynamics. Now, what if (and bear with me here) we want something a little weirder? Say, what if we want a heat sink that has a particular dissipation pattern? Or what if you want a photonic crystal that traps light of a given wavelength in some region? Or a horn which has specific resonances?

We can write down the optimization problems corresponding to each of these circumstances: in general, these problems are very hard to solve in ways that aren't just "try all possible designs and pick the best one." (And there are a *lot* of possible designs.) On the other hand, by using some simple heuristics—gradient descent, for example—we appear to give much better devices than what almost anyone can do by hand. This approach, while it appears to work well in practice, brings up a few questions with no obvious answers.

Maybe there is some design that is really complicated that these heuristics almost always miss, but that is much better than the current ones.

It is possible that the objective we are requesting is physically impossible to achieve and we will never find a good design.

Many heuristics depend heavily on the initial design we provide. Physical intuition sometimes appears to provide good initializations, but often the final design is unintuitive, so perhaps there are better approaches.

The paper provides (some) answers to these questions. In particular, it answers point (2) as its main goal, which gives a partial answer to (1) (namely that the heuristics we use appear to give designs that are often close to the best possible design, at least for the problems we tested), and an answer to (3), since the impossibility result suggests an initial design as a byproduct of computing the certificate of impossibility.

I'll explain the interesting parts of this paper in more detail below, since the paper (for the sake of brevity) simply references the reader to derivations of the results (and leaves some as exercises).

In optimization theory, there is a beautiful idea called *Lagrange duality*, which gives lower bounds to any optimization problem you can write down (at least theoretically speaking).

Let's say we have the following optimization problem,

\[ \begin{array}{ll} \text{minimize} & f(x)\\ \text{subject to} & h(x) \le 0, \end{array} \](this encompasses essentially every optimization problem ever) with objective function \(f: \mathbb{R}^n \to \mathbb{R}\) and constraint function \(h: \mathbb{R}^n \to \mathbb{R}^m\), where the inequality is taken elementwise. Call the optimal value of the objective of the optimization problem \(p^\star\), which we will see again soon.

Continuing, we can then formulate the *Lagrangian* of the problem,

with \(\lambda \ge 0\). Finally, we formulate the dual *function*

Now, and here's the magic, this dual function \(g(\lambda)\) at any \(\lambda \ge 0\) is always a lower bound for the optimal objective \(p^\star\). Why? Well,

\[ g(\lambda) = \inf_x \mathcal{L}(x, \lambda), \]and, by definition of \(\inf\),

\[ \inf_x \mathcal{L}(x, \lambda) \le \mathcal{L}(x, \lambda), \]for every \(x\). Now, every feasible point \(x^\mathrm{feas}\) of the optimization problem satisfies \(h(x) \le 0\) (this is the definition of 'feasible'), so, since \(\lambda \ge 0\),

\[ \mathcal{L}(x^\mathrm{feas}, \lambda) = f(x^\mathrm{feas}) + \underbrace{\lambda^Th(x^\mathrm{feas})}_{\le 0} \le f(x^\mathrm{feas}). \]In other words, for any feasible point, \(\mathcal{L}(x^\mathrm{feas}, \lambda)\) is always smaller than the objective value at that point. But, since \(g(\lambda)\) is smaller than \(\mathcal{L}(x, \lambda)\) for *any* \(x\), not just the feasible ones, we have

for any feasible point. This means that it is also at most as large as the optimal value (since every optimal point of this optimization problem is feasible). That is,

\[ g(\lambda) \le p^\star. \]Therefore, for any \(\lambda\ge 0\), we know that \(g(\lambda)\) is always a lower bound to the optimal objective value!

Of course, sometimes computing \(g(\lambda)\) is at least as difficult as solving the original problem (due to the \(\inf\) we have in the definition of \(g\)). It just so happens that many physical equations and objectives we care about are of a form elegant enough to give an explicit formula for \(g\), which is the main point of this paper.

Of course, we often want the best (largest) lower bound, not just *a* lower bound (which can often be quite bad). In other words, we want to maximize our lower bound. We can phrase this as the new optimization problem,

What is interesting is that this optimization problem is always convex—*e.g.*, it is almost always easy to compute the optimal value, *if* we can explicitly write down what \(g\) is. (I won't prove this here, but the proof is very straightforward. Take a peek at section 5.1.2 in Boyd's *Convex Optimization*.)

Many of the problems we're interested in (including design in photonics via Maxwell's equations,^{[1]} acoustics via Helmholtz's equations, quantum mechanics via Schrodinger's equation, and heat engineering via the heat equation) have physics equations of the form (once discretized)

where \(\theta \in \mathbb{R}^n\) are the design parameters (*e.g.* permittivity in the case of photonics, or speed of sound in the material in the case of acoustics) and \(z \in \mathbb{R}^n\) is the field (*e.g.* the electric field in photonics, or the amplitude of the wave in acoustics). \(A \in \mathbb{R}^{n\times n}\) is a matrix encoding the physics (the curl of the curl in Maxwell's equations, or a discretized Laplacian in Helmholtz's) and \(b \in \mathbb{R}^n\) is an excitation of the field.

More specifically, take a peek at Helmholtz's equation:

\[ \nabla^2 a(x) + \left(\frac{\omega^2}{c(x)^2}\right)a(x) = u(x), \]where \(c: \mathbb{R}^3 \to \mathbb{R}_{> 0}\) is a function specifying the speed of sound at every point in the material, while \(a: \mathbb{R}^3 \to \mathbb{R}\) is a function specifying the amplitude at each point, \(u: \mathbb{R}^3 \to \mathbb{R}\) is a function specifying an excitation, and \(\omega \in \mathbb{R}_{\ge 0}\) is the frequency of the wave. We can make some simple correspondences:

\[ \Bigg(\underbrace{\nabla^2}_{A} + \underbrace{\bigg(\frac{\omega^2}{c(x)^2}\bigg)}_{\mathrm{diag}(\theta)}\Bigg)\underbrace{a(x)}_{z} = \underbrace{u(x)}_{b}. \]Now, we usually want the field (\(z\)) to look similar to a desired field (which we will call \(\hat z\)), while satisfying the physics equation described above. We can phrase this in several ways, but a particularly natural one is by attempting to minimize the objective \(\left\|z - \hat z\right\|_2^2\).

Finally, we are only able to choose materials within a specific range: that is, \(\theta^\mathrm{min} \le \theta \le \theta^\mathrm{max}\).

Putting all of this together, we can write the optimization problem as

\[ \begin{array}{ll} \text{minimize} & \frac12\left\|z - \hat z\right\|_2^2\\ \text{subject to} & (A + \mathrm{diag}(\theta))z = b\\ & \theta^\mathrm{min} \le \theta \le \theta^\mathrm{max}. \end{array} \]which is exactly problem (1) in the paper, in the special case where \(W = I\), the identity matrix.

Here is essentially the only 'magic' part of the paper. First, we can write the Lagrangian of the problem as,

\[ \mathcal{L}(z, \theta, \nu) = \frac12\left\|z - \hat z\right\|_2^2 + \nu^T((A + \mathrm{diag}(\theta))z - b). \]Now, there is something weird here: notice that I sneakily dropped the term containing the lower and upper limits for \(\theta\)—this idea is, in fact, what saves the entire approach. What we will first do is the usual thing: we'll minimize the Lagrangian over all possible \(z\), which we can easily do since the Lagrangian is a convex quadratic over \(z\). In particular, taking the gradient over \(z\) and setting it to zero (which is necessary and sufficient by convexity and differentiability) gives us that the optimal \(z\) is

\[ z = \hat z - (A + \mathrm{diag(\theta)})^T\nu, \]which means that

\[ \inf_z \mathcal{L}(z, \theta, \nu) = - \frac12\left\|\hat z - (A + \mathrm{diag(\theta)})^T\nu\right\|_2^2 - \nu^Tb + \frac12\|\hat z\|_2^2. \]The next step is then finding the infimum of \(\mathcal{L}\) over \(\theta\). That is, finding

\[ \inf_\theta \left(\inf_z \mathcal{L}(z, \theta, \nu)\right). \]Now, of course, minimization over all \(\theta\) is a lower bound (but not a very good one), since, unless \(\nu = 0\), we can send the whole thing to negative infinity. (Why?)

What we can do instead is minimize over \(\theta\), constrained to its feasible range, \(\theta^\mathrm{min} \le \theta \le \theta^\mathrm{max}\). I'll leave it as an exercise for the reader as to why this is still a lower bound, but you should ponder this very carefully, because it is the main point of the paper. As an initial hint, take a second look at the proof above for why the Lagrangian is a lower bound in the first place. Of course, a second hint can be found in the paper which gives a somewhat-natural construction (sometimes called the "partial Lagrangian"), but I highly recommend sitting down with a bit of wine (or something stronger) and thinking about it!^{[2]}

If you've convinced yourself of this (or haven't yet, but want to continue), we now have the following minimization problem:

\[ \begin{aligned} g(\nu) &= \inf_{\theta^\mathrm{min} \le \theta \le \theta^\mathrm{max}} \left(\inf_z \mathcal{L}(z, \theta, \nu)\right)\\ &= \inf_{\theta^\mathrm{min} \le \theta \le \theta^\mathrm{max}} \left(- \frac12\left\|\hat z - (A + \mathrm{diag(\theta)})^T\nu\right\|_2^2 - \nu^Tb + \frac12\|\hat z\|_2^2\right) \end{aligned} \]The trick is to notice two things. One, that the objective is concave in \(\theta\) and, two, that the objective is *separable* over each component of \(\theta\).

First off, let's say a function \(v: \mathbb{R} \to \mathbb{R}\) is concave over the interval \([L, U]\), then it achieves its minimum value at the boundaries of the interval. Why? Well, the definition of concavity says, for every \(0 \le \gamma \le 1\),

\[ v(\gamma L + (1- \gamma)U) \ge \gamma v(L) + (1-\gamma)v(U) \ge \min\{v(L), v(U)\}. \]but any point in the interval \([L, U]\) is a convex combination of \(L\) or \(U\)! So every point inside of the interval is at least as large as the smallest endpoint of the interval, which completes the proof.

This solves our problem: since the objective is separable, then we only need to consider each component of \(\theta\), and, because it's concave, then we know that an optimal \(\theta_i\) is one of either \(\theta^\mathrm{min}_i\) or \(\theta^\mathrm{max}_i\). Replacing the complicated \(\inf\) with this (much simpler) \(\min\) gives the analytic solution for \(g\):

\[ \begin{aligned} g(\nu) = \sum_i \min\bigg\{-\frac12 (\hat z_i - a_i^T\nu + \theta_i^\mathrm{min} \nu_i)^2, &- \frac12 (\hat z_i - a_i^T\nu + \theta_i^\mathrm{max} \nu_i)^2\bigg\} \\&- \nu^Tb + \frac12\|\hat z\|_2^2, \end{aligned} \]or, writing it in the same way as the paper, by pulling out the \(-1/2\) (and using \(\theta^\mathrm{min} = 0\)),

\[ g(\nu) = -\frac12 \sum_i \max\bigg\{ (\hat z_i - a_i^T\nu + \nu_i)^2, (\hat z_i - a_i^T\nu + \theta_i^\mathrm{max} \nu_i)^2\bigg\} - \nu^Tb + \frac12\|\hat z\|_2^2. \]Now that we have an analytic form for \(g\) (our set of lower bounds), we can maximize the function to get the best lower bound. As discussed before, this is a convex optimization problem which can be formulated by CVXPY and solved using one of the many available solvers for convex quadratically-constrained quadratic programs (QCQPs) or second-order conic programs (SOCPs).

I'll give a quick summary of the results of the paper, but this is the section I would recommend checking out in the paper itself. (There are pretty pictures!)

For a relatively complex design, we found that a simple, commonly used heuristic finds a design with an objective value lying around 9% above the lower bound, and, therefore has objective value at most 9% above the *best possible* design. (In general, though, we suspect that the true optimum lies closer to the designs that the heuristics give than the lower bound we come up with.) In other words, it is physically impossible to more-than-marginally improve upon this design.

Additionally (I might discuss how this is done in a later post), we receive an initial design that is qualitatively quite similar to the final, locally-optimized design. See the image below.

<img src="/images/physics-impossibility-results/primal-dual-comparison.png" class="insert" style="width: 100%"> *Comparison between the design suggested by the lower bound and the locally-optimized design.*

I highly recommend checking out the pre-print that is up on arXiv for more info. Also, if you spot any mistakes (in either the post or the paper), please do @ me!

[1] | This is... almost accurate, but not quite. It turns out a small modification to the problem is needed for Maxwell's equations in two and three dimensions. For specifics, see the appendix in the paper. |

[2] | Honestly, I only recommend reading this blog with wine (or whatever you have at hand). Not sure it's bearable, otherwise. |

Physics has this nice little law called the second law of thermodynamics, which governs every physical thermodynamical system in question. The second law is usually phrased as the nice quote "everything tends to disorder," or some other variation thereof which sounds intuitive but is as opaque as a concrete wall when applied in practice.

As usual, I won't really touch this or other similar discussions with a 10 foot pole (though here is a good place to start), but I'll be giving some thoughts on a similar mathematical principle which arises from statistical systems.

Since this is a short post, I won't be describing Markov chains in detail, but as a refresher, a Markov chain (or Markov process) is a process in which the next state of the process depends only on the current state. In other words, it's a little like the game Snakes and Ladders, where your next position depends only on where you are in the current square and your next dice throw (independent of it being the 1st or 5th time you've landed on the same square).

In particular, we know probability of being at a new square, \(x_{t+1}\) at time \(t+1\), given that we were in square \(x_t\) at time \(t\). In other words, we know \(p(X_{t+1}=x~|~X_{t}=x')\). Similarly, if we weren't sure of our position at time \(t\), but rather we have a probability distribution over positions, say \(p(X_t = x)\) (which I will call \(p_t(x)\), for convenience), then the probability of being at position \(x\) at time \(t+1\) given our belief about possible positions, \(x'\) at time \(t\) is

\[ p_{t+1}(x) = \sum_{x'} p(X_{t+1}=x~|~X_{t}=x')p_t(x').\tag{1} \]In other words, we just multiply these two probabilities and sum over all possible states \(x'\) at time \(t\). The defining trait of (stationary) Markov processes is that \(p(X_{t+1}=x~|~X_{t}=x')\), which I will call \(K(x, x')\) from now on, is the *same* for all \(t\), and equation (1), now written as

holds for any \(t\).

It's probably not hard to believe that Markov chains are used absolutely everywhere, since they make for mathematically simple, but surprisingly powerful models of, well, everything.

This is all I will mention about Markov chains, in general. If you'd like a bit of a deeper dive, this blog post is a beautiful (and interactive!) reference. I highly recommend it.

Let \(p_t\) be the distribution of a process at time \(t\), then the second law says that (and I will use statistics notation, rather than physics notation, from here on out!),

\[ H(p_t) \equiv -\sum_x p_t(x) \log p_t(x), \]is non-decreasing as time, \(t\), increases. Note that I'll be dealing with discrete time and space here, but all of these statements with some modifications hold for continuous processes. Anyways, more generally, we can write

\[ H(p_{t+1}) \ge H(p_t), \]but it turns out this law, as stated, doesn't quite hold for many Markov processes. It does, on the other hand, hold for a set of processes where the transition probabilities are symmetric (more generally, this holds iff the transitions are doubly-stochastic. Cover has a slick, few-line proof of this which relies on some properties of the KL-divergence).

In this case, the probability of going from state \(A\) to state \(B\) is the same as the probability of going from state \(B\) to state \(A\). Writing this out mathematically, it says:

\[ K(x, x') = K(x', x). \]I should note that this is a *very* strong condition, but it can be quite useful in giving a simple proof of the above law. To prove this, first note that the KL-divergence is nonnegative, since the negative log is convex, thus by Jensen's inequality (this is the same proof as the previous post):

since \(p_{t'}, p_t\) are probability distributions (i.e., nonnegative and sum to one).

Here is the magical trick to proving the above. Note that

\[ D(p_t \lVert p_{t+2}) \ge 0, \]so

\[ -\sum_x p_t(x) \log \frac{p_{t+2}(x)}{p_t(x)} \ge 0, \]which means

\[ -\sum_x p_t(x) \log p_{t+2}(x) \ge -\sum_x p_t(x) \log p_t(x). \]But, by definition, we have that

\[ p_{t+2}(x) = \sum_{x'}K(x, x')p_{t+1}(x'), \]so

\[ -\sum_x p_t(x) \log \left(\sum_{x'} K(x, x')p_{t+1}(x')\right) \ge -\sum_x p_t(x) \log p_t(x), \]but, by Jensen's inequality (again!) on the left hand side we get,

\[ -\sum_x p_t(x) \log \left(\sum_{x'} K(x, x')p_{t+1}(x')\right) \le -\sum_{x, x'} p_t(x)K(x, x') \log p_{t+1}(x'). \]Since we know \(K(x, x') = K(x', x)\), then we immediately have that

\[ \sum_x p_t(x)K(x, x') = p_{t+1}(x'), \]so, putting it all together

\[ \begin{aligned} H(p_{t+1}) &= -\sum_{x'} p_{t+1}(x') \log p_{t+1}(x') \\ &= -\sum_{x, x'} p_t(x)K(x, x') \log p_{t+1}(x')\\ &\ge -\sum_x p_t(x) \log p_t(x) \\ &= H(p_t), \end{aligned} \]which is what we wished to prove.

So, while it turns out this law doesn't hold for general Markov processes, a very similar law *does* hold. If a Markov process has a stationary distribution, \(p\), then:

so, as the Markov chain continues evolving, the KL divergence between the current distribution and the equilibrium distribution *never decreases*.

In fact, more generally, for *any* two initial probability distributions \(p_0, q_0\), we have that

so any two distributions undergoing the same (Markovian) dynamics never decrease in KL-divergence! Even if the Markov process does *not* have a unique stationary distribution, there is still a type of second law which holds, in a very general sense.

As before, Cover has a fantastic, slick proof of the above, which I highly recommend you read!

]]>The maximum-likelihood estimator (MLE) is probably the simplest estimator, if you have a probability distribution \(p(x|\theta)\) which models your data. In this case we try to pick the hypothesis \(\theta\) which makes our observed data the most likely. In other words, we want to solve the optimization problem:

\[ \theta^\mathrm{MLE} = \underset{\theta}{\operatorname{argmax}}~~p(x~|~\theta). \]While this framework is quite general, we'll prove that this estimator is consistent in the case where our data points, \(x = \{x^i\}\), are all independently drawn from \(p(\cdot ~|~ \theta^*)\), where \(\theta^*\) is the "true" hypothesis. In other words, when

\[ p(x~|~\theta) = \prod_i p(x^i~|~\theta) \]The proof that this estimator is consistent is relatively simple and assumes only the weak law of large numbers, which says that the empirical mean of a bunch of i.i.d variables \(\{Y_i\}\) converges^{[1]} to its expectation

(from here on out, I will write 'converges in probability' just as \(\to\), instead of \(\overset{p}{\to}\)).

First, note that^{[2]}

since \(0 < x \le y \iff \log(x) \le \log(y)\) (i.e. \(\log\) is monotonic^{[3]}, so it preserves any maximum or minimum). We also have

and now we need some way of comparing the current hypothesis \(\theta\), with the true hypothesis \(\theta^*\). The simplest way is to subtract one from the other and show that the difference is less than zero whenever \(\theta \ne \theta^*\), so this is what we will do.^{[4]} In particular:

If we can prove the quantity above is negative with high probability, then we're set! So divide by \(n\) on both sides and note that, by the weak law, we have

\[ \frac{1}{n}\sum_i \log\left(\frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)}\right) \to \mathbb{E}_{X}\left(\log\left(\frac{p(X~|~\theta)}{p(X~|~\theta^*)}\right)\right). \](the expectation here is taken with respect to the true distribution \(p(\cdot ~|~\theta^*)\)). Now, \(\log(x)\) is a concave function (proof: take the second derivative and note that it's always negative), so, this means that

\[ \mathbb{E}(\log(Y)) \le \log(\mathbb{E}(Y)), \]for any random variable \(Y\) (this is Jensen's inequality). In fact, in this case, equality can only happen if \(Y\) takes on a single value, so in general, we have

\[ \mathbb{E}(\log(Y)) < \log(\mathbb{E}(Y)). \]Applying this inequality to the previous line is the only magical part of the proof, which gives us

\[ \begin{aligned} \mathbb{E}_{X}\left(\log\left(\frac{p(X~|~\theta)}{p(X~|~\theta^*)}\right)\right) &< \log\mathbb{E}_{X}\left(\frac{p(X~|~\theta)}{p(X~|~\theta^*)}\right) \\ &= \log \int_S p(X~|~\theta^*)\frac{p(X~|~\theta)}{p(X~|~\theta^*)}~dX\\ &= \log \int_S p(X~|~\theta)~dX\\ &= \log 1\\ &= 0. \end{aligned} \]So, as \(n \uparrow\infty\), we find that

\[ \frac{1}{n}\sum_i \log\left(\frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)}\right) < 0 \]or, multiplying by \(n\) and rearranging,

\[ \sum_i \log p(x^i~|~\theta) < \sum_i \log p(x^i~|~\theta^*) \]with high probability. So, any point which is not \(\theta^*\) will have a lower likelihood than \(\theta^*\).^{[5]}

The next question, of course, is how many samples do we need to actually guess the right hypothesis? There are several ways of attacking this question, but let's start with a basic one: what is the probability that a wrong (empirical) likelihood is actually better than the true empirical likelihood? In other words, can we give an upper bound on

\[ P\left(\prod_i p(x^i~|~\theta) \ge \prod_i p(x^i~|~\theta^*)\right) \]that depends on some simple, known quantities? Applying Markov's inequality directly yields the trivial result

\[ P\left(\prod_i \frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)} \ge 1\right) \le \mathbb{E}_x\left(\prod_i \frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)}\right) = 1, \]that the probability is at most 1. So where do we go from here? Well, as before, we can turn the product into a sum by taking the log of both sides and dividing by \(n\) (déjà vu, anyone?),

\[ P\left(\prod_i \frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)} \ge 1\right) = P\left(\frac1n\sum_i \log\left( \frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)}\right) \ge 0\right). \]Here, we can weaken the question a little bit by asking: how likely is it that our wrong hypothesis has a higher log-likelihood than the right one *by any amount*, \(\varepsilon > 0\). In other words, let's give a bound on

Here comes a little bit of magic, but this is a general method in what are known as Chernoff bounds. It's a good technique to keep in your toolbox if you haven't quite seen it before!

Anyways, since \(\log\) is a monotonic function, note that \(\exp\) (its inverse) is also monotonic, so,

\[ P\left(\frac{1}{n}\sum_i \log\left(\frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)}\right) \ge \varepsilon\right) = P\left(\exp\left\{\sum_i \log\left( \frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)}\right)\right\} \ge e^{n\varepsilon}\right), \]and applying Markov's inequality to the right-hand-side yields

\[ P\left(\exp\left\{\sum_i \log\left(\frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)}\right)\right\} \ge e^{n\varepsilon}\right) \le e^{-n\varepsilon}, \]so as the number of samples increases, our wrong hypothesis becomes exponentially unlikely to exceed the true hypothesis by more than \(\varepsilon\).

Of course, at any point in this proof, we could've multiplied both sides of the inequality by \(\lambda > 0\) and everything would've remained true, but note that then we would have a bound

\[ P\left(\frac1n\sum_i \log\left( \frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)}\right) \ge \varepsilon\right) \le \mathbb{E}_X\left[\left(\frac{p(X ~|~ \theta)}{p(X~|~\theta^*)}\right)^\lambda\right]~ e^{-\lambda n\varepsilon}, \]which looks almost nice, except that we have no control over the tails of

\[ \left(\frac{p(X ~|~ \theta)}{p(X~|~\theta^*)}\right)^\lambda, \]since at no point have we assumed anything about the dependence of \(p\) on \(\theta\) or \(X\) (apart from it being a correct probability distribution). More generally speaking, given \(\lambda \ne 0, 1\), we can find a function such that this quantity blows up (exercise for the reader!), which makes our bound trivial.

It is possible to make some assumptions about how these tails behave, but it's not entirely clear that these assumptions would be natural or useful. If anyone has further thoughts on this, I'd love to hear them!

The second set of lower-bounds that are easy to derive and are surprisingly useful are the Cramér-Rao bounds on estimators. In particular, we can show that, for any estimator \(\hat \theta\) whose expectation is \(\mathbb{E}(\hat \theta) = \psi(\theta)\), with underlying probability distribution \(p(\cdot~|~\theta)\), then^{[6]}

where \(I(\theta) \ge 0\) is the Fisher information of \(p(\cdot~|~\theta)\), which is something like the local curvature of \(\log p(\cdot~|~\theta)\) around \(\theta\). In particular, it is defined as

\[ I(\theta) = -\mathbb{E}_X\left(\frac{\partial^2 \log p(X~|~\theta)}{\partial \theta^2}\right). \]In other words, the inequality says that, the more flat \(\log p\) is at \(\theta\), the harder it is to correctly guess the right parameter. This makes sense, since the flatter the distribution is at this point, the harder it is for us to distinguish it from points around it.

I'll give a simple proof of this statement soon (in another post, since this one has become quite a bit longer than expected), but, to see why this makes sense, let's go back to the original proof of the consistency of the MLE estimator for a given probability distribution. Note that assuming that \(\psi(\theta) = \theta\) then gives us bounds on the variance of an unbiased estimator of \(\theta\).

At one point we used the fact that

\[ \frac{1}{n}\sum_i \log\left(\frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)}\right) \to \mathbb{E}_{X}\left(\log\left(\frac{p(X~|~\theta)}{p(X~|~\theta^*)}\right)\right) \equiv -D(\theta ~\Vert~\theta^*) \le 0. \]This quantity on the right is called the KL-divergence, and it has some very nice information-theoretic interpretations, which I highly recommend you read about, but which I will not get into here. Anyways, assuming that \(\theta\) is close to \(\theta^*\), we can do a Taylor expansion around the true parameter \(\theta^*\) to find

\[ \frac{1}{n}\sum_i \log\left(\frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)}\right) \approx \frac{1}{n}\sum_i \left(\underbrace{\log\left(\frac{p(x^i~|~\theta^*)}{p(x^i~|~\theta^*)}\right)}_{=0} + (\theta - \theta^*)\frac{\partial_\theta p(\cdot ~|~\theta^*)}{p(\cdot ~|~ \theta^*)} + \dots\right) \]and the quantity on the right hand side goes to, as \(n\uparrow \infty\),

\[ \begin{aligned} \frac{1}{n}\sum_i \left((\theta - \theta^*)\frac{\partial_\theta p(\cdot ~|~\theta^*)}{p(\cdot ~|~ \theta^*)} + O((\theta - \theta^*)^2)\right) &\to (\theta - \theta^*)\mathbb{E}_X\left(\frac{\partial_\theta p(X ~|~\theta^*)}{p(X ~|~ \theta^*)}\right) \\ &= (\theta - \theta^*)\int_S \frac{\partial_\theta p(X~|~\theta^*)}{p(X~|~\theta^*)}p(X~|~\theta^*)~dX\\ &=(\theta - \theta^*)\int_S \partial_\theta p(X~|~\theta^*)~dX\\ &=(\theta - \theta^*)\partial_\theta\left(\int_S p(X~|~\theta^*)~dX\right)\\ &= (\theta - \theta^*)\partial_\theta (1)\\ &= 0 \end{aligned} \]...zero?!^{[7]} Well, the expectation of the first derivative of the log-likelihood vanishes, so taking the second term in the Taylor expansion yields

Putting it all together, we have that

\[ \frac{1}{n}\sum_i \log\left(\frac{p(x^i~|~\theta)}{p(x^i~|~\theta^*)}\right) \to -D(\theta \Vert \theta^*) \approx -\frac12(\theta - \theta^*)^2I(\theta^*) + O((\theta - \theta^*)^3), \]or that the curvature of the KL-divergence around \(\theta^*\) is the Fisher information! In this case, it shouldn't be entirely surprising that there is some connection between how well we can measure a parameter's value and the Fisher information of that parameter, since the likelihood's noise around that parameter is given by the Fisher information.

On the other hand, I haven't been able to find a direct proof of the above bound (or, even, any other nice bounds) given *only* the above observation. So, while the connection might make sense, it turns out the proof of the Cramér-Rao bound uses a slightly different technique, which I will present later (along with some other fun results!).

[1] | In probability. I.e., the probability that the empirical mean differs from the expectation by some amount, \(\left|\frac{1}{n}\sum_i Y_i - \mathbb{E}[Y_1]\right| > \varepsilon\), goes to zero as \(n\uparrow \infty\). A simple proof in the finite-variance case follows from Chebyshev's inequality (exercise for the reader!). |

[2] | It's often easier to deal with log-probabilities than it is to deal with probabilities, so this trick is relatively common. |

[3] | In fact, the logarithm is strictly monotonic, so it preserves minima uniquely. In other words, for any function \(\phi: S\to \mathbb{R}^{> 0}\), \(\phi\) and \(\log \circ\, \phi\) have minima and maxima at exactly the same points. |

[4] | I am, of course, being sneaky: the subtraction happens to work since this just happens to yield the KL-divergence in expectation—but that's how it goes. Additionally, the requirement really is not that \(\theta \ne \theta^*\), but rather that \(p(x~|~\theta^*) \ne p(x~|~\theta)\), just in case there happen to be multiple hypotheses with equivalent distributions. Since you're reading this then just assume throughout that \(p(\cdot~|~\theta) \ne p(\cdot~|~\theta^*)\) on some set with nonzero probability (in the base distribution) whenever \(\theta \ne \theta^*\). |

[5] | While it may seem that there should be easy bounds to give immediately based on this proof, the problem is that we do not have good control of the second moment of \(\log(p(\cdot~|~\theta^*)/p(\cdot~|~\theta))\) (this quantity may not even converge in a nice way). This makes giving any kind of convergence rate quite difficult, since the proof of the weak law given only a first-moment guarantee uses the dominated convergence theorem to give non-constructive bounds. |

[6] | This is a derivation of the one-dimensional case, but the \(n\)-dimensional case is almost identical. |

[7] | All I will say about changing the derivative and integral around is that it is well-justified by dominated convergence. |

Most people know Principal Component Analysis (PCA) as a fast, and easily-scalable dimensionality-reduction technique used quite frequently in machine learning and data exploration—in fact, it's often mentioned that one-layer, linear neural network^{[1]} applied on some data-set recovers the result from PCA.

It's (also) often mentioned that PCA is one of the few non-convex problems that we can solve efficiently, though a (let's say 'non-constructive') answer showing this problem is convex is given in this Stats.SE thread, which requires knowing the eigenvectors of \(X^TX\), a priori. It turns out it's possible to create a fairly natural semi-definite program which actually constructs the solution in its entirety.

Since I'll only give a short overview of the topic of PCA itself, I won't go too much into depth on methods of solving this. But, the general idea of PCA is to find the best low-rank approximation of a given matrix \(A\). In other words, we want, for some given \(k\):

\[ \begin{aligned} & \underset{X}{\text{minimize}} & & \| A - X \|_F^2 \\ & \text{subject to} & & \text{rank}(X) = k, \end{aligned} \]where \(\| B \|_F^2\) is the square of the Frobenius norm of \(B\) (i.e. it is the sum of the squares of each entry of \(B\)). Why is this useful? Well, in the general formulation, we can write the SVD decomposition of some optimal \(X^* \in \mathbb{R}^{m\times n}\),

\[ X^* = U^*\Sigma^* (V^*)^T \]with orthogonal \(U^* \in \mathbb{R}^{m\times k}, V^*\in \mathbb{R}^{n\times k}\) and diagonal \(\Sigma^* \in \mathbb{R}^{k\times k}\). Then the columns of \(V^*\) represent the \(k\) most important features of \(A\) (assuming that each row of \(A\) is a point of the dataset). This may seem slightly redundant if you already know the punchline, but we'll get there in a second.

For now, define the SVD of \(A\) in a similar way to the above

\[ A = U\Sigma V^T. \]with orthogonal \(U, V\) and diagonal \(\Sigma\).

For convenience, it's easiest to define the diagonal of \(\Sigma\) (the singular values of \(A\)) to be sorted with the top-left value being the largest and bottom-right value being the smallest. Then let \(U_k\) be the matrix which contains only the first \(k\) columns of \(U\) (and similarly for \(V_k\)), while \(\Sigma_k\) is the \(k\) by \(k\) diagonal sub-matrix of \(\Sigma\) containing only the first \(k\) values of the diagonal (as usual, starting from the top left).

Now we can get to the punchline I was talking about earlier: it turns out that the SVD of \(X^*\) is the *truncated* SVD of \(A\), in other words, if the SVD of \(A\) is \(U\Sigma V^T\), then the optimal solution is

This is the usual way of computing the PCA decomposition of \(A\): simply take the SVD and then look at the first \(k\) columns of \(V\).^{[2]} We'll make use of this fact to show that the optimal values are equal, but it won't be necessary to actually *compute* the result.

In general, semi-definite programs (i.e. optimization over symmetric, positive-semidefinite matrices with convex objectives and constraints) are convex problems. Here, we'll construct a (relatively) simple reduction of the non-convex problem of PCA, as presented above, to the SDP.

This entire idea was interesting to me, since it was mentioned in this lecture which was a result I didn't know about. There aren't any complete proofs of this online, other than a quick mention in Vu, et al. (2013), though it's not hard to show the final result given the general ideas. I highly encourage you to try the proof out after reading only the main ideas, if you're interested!

First, we'll start with the usual program, call it *program Y*:

and construct the equivalent program (this step can be skipped with a cute trick below), with \(F = A^TA\),

\[ \begin{aligned} & \underset{P}{\text{minimize}} & & \| F - P \|_F^2 \\ & \text{subject to} & & \text{rank}(P) = k,\\ &&& P^2 = P,\\ &&& P^T = P, \end{aligned} \]in other words, this is a program over projection matrices \(P\). This can then be put into the form

\[ \begin{aligned} & \underset{P}{\text{maximize}} & & \text{tr}(FP) \\ & \text{subject to} & & \text{rank}(P) = k,\\ &&& P^2 = P,\\ &&& P^T = P. \end{aligned} \]for some matrix \(F\), and it can be relaxed into the following SDP, let's call it *problem Z*,

where \(A \preceq B\) is an inequality with respect to the semi-definite cone (i.e. \(A \preceq B \iff B - A\) is positive semi-definite). You can then show that this SDP has *zero integrality gap* with the above program over the projection matrices. More specifically, any solution to the relaxation can be easily turned into a solution of the original program.

Just a random side-note: if you took or followed Stanford's EE364A course for the previous quarter (Winter 2018), the latter part of this proof idea may seem familiar—it was a problem written for the final exam. My original intent with it was to guide the students through the complete proof, but better judgement prevailed and the question was cut down to only that last part with some hints.

The two interesting points of the whole proof are (a) to realize that any solution of the original problem (program Y) can be written as a solution \(X = AP'\) for some projection matrix \(P'\) (which, of course, will turn out to be the projection matrix \(P\) which solves program Z, namely \(P' = V_kV_k^T\)), and (b) to note that we can prove that program Z has zero integrality gap since, if we have a solution to the SDP given by \(P^* = UDU^T\), then we can 'fix' non-integral eigenvalues via solving the problem

\[ \begin{aligned} & \underset{x}{\text{maximize}} & & c^Tx \\ & \text{subject to} & & 1^Tx = k,\\ &&& 0\le x_i \le 1, ~~\forall i, \end{aligned} \]where \(c_i = (U^TAU)_{ii}\). This LP has an integral solution \(x^*\) (what should this solution be?) which preserves the objective value of the original problem, so \(\bar P^* = U\text{diag}(x^*)U^T\) is a feasible, integral solution to the original problem, with the same objective value as the previous so the SDP relaxation is tight!

Using all of this, then we've converted the PCA problem into a purely convex one, without computing the actual solution beforehand.

[1] | More specifically, a one-layer, linear NN with \(\ell_2\) loss. |

[2] | As usual, there are smarter ways of doing this. It turns out one can run a truncated or partial SVD decomposition, which doesn't require constructing all singular values and all the columns of \(U, V\). This is far more efficient whenever \(k\ll \min\{m, n\}\), where \(m,n\) are the dimensions of the data. This latter condition is usually the case for practical purposes. |

For more background on the general optimization case, check out the posts above. In general, I won't be using much content from the previous three posts, so they're not necessary reading other than for context.

First, as before, it's reasonable to ask for a fast heuristic to discover paths on a graph which are 'feasible' in a weak sense (i.e. a path where the UAV does not crash against an obstacle whose trajectory is known). This solution is then relaxed into a continuous problem inro \(\mathbb{R}^2\) space and then optimized over. This latter trajectory is the one actually fed directly to the UAV controller and which is executed by the UAV. It should be noted that performing this weird relaxation is useful since it often takes quite a while for the algorithm to begin to converge to a feasible solution (and often can run into numerical stability problems while trying to do so). Anyways, for more of the details, check out the first first and second posts (and the videos at the bottom, to observe the qualitative behavior).

I should mention that, unlike the previous problems, there already exist some fun results on this notion of the shortest path (including a poly-time \((1+\varepsilon)OPT\) approximation!), but it's interesting enough to describe in a quick post anyways. In general, I assume no constraints on the possible curvature of a trajectory for this approximation, though it's straightforward include them in the general problem if the hit on run time performance isn't an issue.

The problem set up is the following: let's say we have a family of graphs \(G_t\subseteq G\) parametrized by some time parameter \(t\in \mathbb{R}^{\ge 0}\), where \(G\) is the 'universal graph'; in other words, every graph at every point in time is a subset of both edges and vertices of that graph. One idea for constructing this \(G\) is to set \(G = \bigcup_{t\in \mathbb{R}^{\ge 0}} G_t\) and insist that \(G\) be a finite graph.^{[1]}

\(G_t\) at each point encodes some constraints on the current position of the drone, which is indicated by some vertex \(v\in V(G)\) (where \(V(G)\) is the vertex set of \(G\)) and this position is *valid* at time \(t\) if the vertex exists in \(G_t\). In other words, position \(v\) is valid at \(t\) if \(v\in V(G_t)\).

Now, the question becomes: given some cost function \(c: V(G)\times V(G) \to \mathbb{R}^{\ge 0}\) and some start and end nodes, construct a shortest valid path^{[2]} from the start to the end nodes (where the start node is assumed to be at time \(t=0\)), if it exists.

With this definition and knowledge of the \(A^*\) algorithm (hint, hint!), I encourage working out what the solution to this problem is, assuming we have a consistent heuristic for the path.

As a side note: a simple heuristic, which usually works quite well, is to take the \(\ell^2\) distance between two nodes and divide it by the maximum velocity of the UAV—this is consistent since the UAV cannot travel between two points faster than being at its maximum velocity along the shortest possible line. In cases where many of the obstacles are small relative to the size of the graph and are sparse, this idea works extremely well because the approximation is fairly tight.

With that in mind, here's the algorithm, which is really just a (slightly) modified version of \(A^*\). (The code below is like quasi-Python pseudocode, but implementing directly shouldn't require too many changes. Additionally, some things can be easily stored instead of recomputed by exploiting the structure of the cost function.)

```
:::python
q <- priority queue
start_node <- start node
end_node <- end nodec <- edge-cost function
h <- heuristic cost functionG <- graph at time t# Algorithm begins here
add ([start_node], cost=0) to qwhile True:
curr_path, curr_cost = q.pop_smallest()
last_node = curr_path[end] if last_node is end_node:
return curr_path
for neighbor in last_node.neighbors:
new_path = curr_path.append(neighbor)
new_cost = c(new_path)
if neighbor not in G(new_cost):
continue
add (new_path, cost=(new_cost + h(neighbor, end_node))) to q
return None
```

This algorithm returns one of the optimal paths, since a path will only be returned if the total cost of the found path is at most as large as the next possible valid path to some point \(v_t\) plus the heuristic cost \(h(v_t, e)\). By assumption, the heuristic function is a global underestimator, which immediately implies that this return path must have had a minimum possible original cost. I should also point out that there's nothing preventing an exponential time solution (and it's certainly exponential in the worst case... if there doesn't exist a path between \(s, e\), for example)! This is not great, but (as usual) this algorithm works much better than exponential time, in practice.

Another thing to be careful of is that the above algorithm can also return paths which double back on themselves (e.g. if the UAV needs to 'wait' for an obstacle to pass). This may not be desired behavior (at least, definitely not in our case), so specific checks can be added to prevent this, depending on the application. Additionally, there is nothing restricting the cost function to be time-independent, so even this constraint can be relaxed while still maintaining optimality.

Anyways, that's all for today. I wanted to keep this post (relatively) short and sweet since there's another one coming up quite soon on how to perform the optimization found in the previous posts: given a functional form for the position of the obstacles at some point in time—e.g., the next step after finding the approximation. Hopefully there will be some more time next week to write that out, but I make no major promises.

[1] | By the 'union' of graphs, I mean that the new graph should be
\[ \bigcup_{t} G_t = \left(\bigcup_{t} V(G_t), ~ \bigcup_t E(G_t)\right) \]
where \(V(G)\) is the set of vertices of \(G\) and \(E(G)\) is the set of edges of \(G\). |

[2] | A path \(v = (v_t)\) is valid if it is a path from the start node to the end node and each \(v_t \in G_t\) for every possible \(t\). We also have that \(t_{i+1} - t_i = c(v_{t}, v_{t+1})\). In other words, the time at action number \(i\) is the sum of the times of all of the previous actions (this just gives a definition of 'time' in this problem). |

This led to a rabbit hole of asking what loss functions should be included in the optimization package (huber, square, \(\ell_1\), etc.), many of which are relatively straightforward to implement—except, of course, SVM. Almost all of the other cases have few problems in implementation, since computing their proximal gradient is a direct computation (e.g. \(\ell_1\) corresponds immediately to a shrinkage operator as we'll see below), yet the SVM loss has a set of hard constraints which I haven't found a nice way of stuffing into the prox-gradient step (and I suspect that there are no such nice ways, but I'd love to be proven wrong, here); thus, every step requires finding a projection into a polygon, which is, itself, a second optimization problem that has to be solved.

Prox gradients are (generally) really well-behaved and I've been having some fun trying to really understand how well they work as general optimizers—I write a few of those thoughts below along with an odd solution to the original problem.

Proximal gradients are a nice idea emerging from convex analysis which provide useful ways of dealing with tricky, non-differentiable convex functions. In particular, you can think of the proximal gradient of a given function as an optimization problem that penalizes taking steps "too far" in a given direction. Better yet (and perhaps one of the main useful points) is that most functions we care about have relatively nice proximal gradient operators!

Anyways, for now, let's define the proximal gradient of a function \(g\) at some point \(x\) (usually denoted \(\text{prox}_g(x)\), though I will simply call it \(P_g(x)\) for shorter notation) to be

\[ P_g(x) \equiv \mathop{\arg\!\min}\limits_y \left(g(y) + \frac{1}{2}\lVert x- y\lVert_2^2\right) \]the definition is useful only because, if we allow \(\partial g(u)\) to be the subdifferential of \(g\) at \(u\), then optimality guarantees that, if \(y = P_g(x)\), then (by knowing that the subdifferential must be zero at the optimum) we have

\[ x - y \in \partial g(y). \]In other words, \(0 \in \partial g(y)\) iff \(y\) is a fixed point of \(P_g\)—that is, we have reached a minimizer of \(g\) iff

\[ y = P_g(y). \]Additionally, there's no weird trickery that has to be done with subdifferentials since the result of \(P_g\) is always unique, which is a nice side-benefit. Using just this, we can already begin to do some optimization. For example, let's consider the (somewhat boring, but enlightening) example of minimization of the \(\ell_1\)-norm. Using the fact that

\[ u = P_{\lambda |\cdot|}(x) \iff x - u \in \partial |u| \]and using the fact that the \(\ell_1\)-norm is separable, we have, whenever \(u>0\) (I'm considering a single term of the sum, here)

\[ x - u = \lambda \implies u = x-\lambda\text{ whenever } x-\lambda > 0 \]similarly for the \(u=0\) case we have (where \(\lambda S\) for some set \(S\) is just multiplication of every element in the set by \(\lambda\))

\[ x - u = x \in \lambda [-1, 1] = [-\lambda, \lambda]. \]that is

\[ u = 0\text{ whenever } |x|\le \lambda \]and similarly for the \(u<0\) case, we have \(u = x + \lambda\) if \(x < -\lambda\). Since this is done for each component, the final operator has action

\[ u_i = \begin{cases} x_i - \lambda, & x_i > \lambda\\ 0, & |x_i| \le \lambda \\ x_i + \lambda, & x_i < -\lambda. \end{cases} \]This operator is called the 'shrinkage' operator because of its action on its input: if \(x_i\) is greater than our given \(\lambda\), then we shrink it by that amount. Note then, that successively applying (in the same manner as SGD) the update rule

\[ u^{i+1} = P_{|\cdot|}(u^i) \]correctly yields the minimum of the given convex function, i.e. 0. Of course, this isn't particularly surprising since we already know how to optimize the \(\ell_1\)-norm function, \(\lVert x \lVert_1\) (just set it to zero!), but it will help out quite a bit when considering more complicated functions.

Now, given a problem of the form

\[ \min_x f(x) + g(x) \]where \(f\) is differentiable, and \(g\) is convex, we can write down a possible update in the same vein as the above, except that we now also update our objective for \(f\)*and* \(g\) at the same time

Here, \(\gamma^i\) is defined to be the step size for step \(i\). It turns out we can prove several things about this update, but, perhaps most importantly, we can show that it works.

Anyways, this is all I'll say about the proximal gradient step as there are several good resources on the proximal gradient method around which will do a much better job of explaining it than I probably ever will: see this for example.

I assume some familiarity with SVMs, but the program given might require a bit of explanation. The idea of an SVM is as a soft-margin classifier (there are hard-magin SVMs, but we'll consider the former variety for now): we penalize the error of being on the wrong side of the decision boundary in a linear form (with zero penalty for being on the correct side of the decision boundary). The only additional thing is that we also require that the margin's size also be penalized such that it doesn't depend overly on a particular variable (e.g. as a form of regularization).

The usual quadratic program for an SVM is, where \(\xi^\pm_i\) are the slack variables indicating how much the given margin is violated, \(\varepsilon > 0\) is some arbitrary positive constant, and \(\mu\) is the hyperplane and constant offset found by the SVM (e.g. by allowing the first feature of a positive/negative sample to be \((x^\pm_i)_0 = 1\)):

\[ \begin{aligned} & \underset{\xi, \mu}{\text{minimize}} & & \sum_i \xi^+_i + \sum_i \xi^-_i + C\lVert \mu \lVert_2 \\ & \text{subject to} & & \mu^Tx^+_i - \varepsilon \ge -\xi^+_i,\,\,\text{ for all } i \\ &&& \mu^Tx^-_i + \varepsilon \le \xi^-_i,\,\,\text{ for all } i\\ &&& \xi^\pm_i \ge 0,\,\,\text{ for all } i \end{aligned} \]we can rerwite this immediately, given that the class that our data point \(i\) corresponds to is \(y^{i}\in \\{-1, +1\\}\) to a much nicer form

\[ \begin{aligned} & \underset{\xi, \mu}{\text{minimize}} & & 1^T \xi + C\lVert \mu \lVert_2 \\ & \text{subject to} & & -y^i \mu^Tx_i - \varepsilon \ge -\xi_i,\,\,\text{ for all } i \\ &&& \xi_i \ge 0,\,\,\text{ for all } i \end{aligned} \]and noting that the objective is homogeneous of degree one, we can just multiply the constraints and all variables by \(\frac{1}{\varepsilon}\) which yields the immediate result (after flipping some signs and inequalities)

\[ \begin{aligned} & \underset{\xi, \mu}{\text{minimize}} & & 1^T\xi + C\lVert \mu \lVert_2 \\ & \text{subject to} & & \xi_i\ge y^i \mu^Tx_i +1,\,\,\text{ for all } i \\ &&& \xi_i \ge 0,\,\,\text{ for all } i \end{aligned} \]which, after changing the \(\ell_2\) regularizer to an \(\ell_2^2\)-norm regularizer (which is equivalent for approriate regularization hyperparameter, say \(\eta\)^{[1]}) yields

This is the final program we care about and the one we have to solve using our proximal gradient operator. In general, it's not obvious how to fit the inequalities into a step, so we have to define a few more things.

For now, let's define the set indicator function

\[ g_S(x) = \begin{cases} 0 & x\in S\\ +\infty & x\not\in S \end{cases} \]which is convex whenever \(S\) is convex; we can use this to encode the above constraints (I drop the \(S\) for convenience in what follows) such that a program equivalent to be above is

\[ \underset{\xi, \mu}{\text{minimize}} \,\, 1^T \xi + C\lVert \mu \lVert_2^2 + g(\mu, \xi) \]which is exactly what we wanted! Why? Well:

\[ \underset{\xi, \mu}{\text{minimize}}\,\, \underbrace{1^T \xi + C\lVert \mu \lVert_2^2}_\text{differentiable} + \underbrace{g(\mu, \xi)}_\text{convex} \]so now, we just need to find the proximal gradient operator for \(g(x)\) (which is not as nice as one would immediately think, but it's not bad!).

Now, let's generalize the problem a bit: we're tasked with the question of finding the prox-gradient of \(g_S(x)\) such that \(S\) is given by some set of inequalities \(S = \{x\,|\, Ax\le b\}\) for some given \(A, b\).^{[2]} That is, we require

which can be rewritten as the equivalent program (where the \(1/2\lambda\) is dropped since it's just a proportionality constant)

\[ \begin{aligned} & \underset{y}{\text{minimize}} & & \lVert x- y\lVert_2^2 \\ & \text{subject to} & & Ay\le b \end{aligned} \]it turns out this program isn't nicely solvable using the prox-gradient operator (since there's no obvious way of projecting onto \(Ax\le b\) and *also* minimizing the quadratic objective). But, of course, I wouldn't be writing this if there wasn't a cute trick or two we could do: note that this program has a strong dual (i.e. the values of the dual program and the primal are equal) by Slater's condition, so how about trying to solve the dual program? The lagrangian is

from which we can derive the dual by taking derivatives over \(y\):

\[ \nabla\mathcal{L} = y-x + A^T\gamma = 0 \implies y = x - A^T\gamma \]and plugging in the above (and simplifying) yields the program

\[ \begin{aligned} & \underset{\eta}{\text{maximize}} & & \eta^T(Ax-b) - \frac{1}{2}\lVert A^T\eta\lVert_2^2 \\ & \text{subject to} & & \eta \ge 0 \end{aligned} \]from which we can reconstruct the original solution by the above, given:

\[ y = x - A^T\eta. \]This program now has a nice prox step, since \(\left(P_{\eta \ge 0}(\eta)\right)_i = \max\{0, \eta_i\}\) (the 'positive' part of \(\eta_i\), in other words). This latter case is left as an exercise for the reader.

Putting the above together yields a complete way of optimizing the SVM program: first, take a single step of the initial objective, then find the minimum projection on the polygon over the given inequality constraints by using the second method, and then take a new step on the original, initial program presented.

Of course, this serves as little more than an academic exercise (I'm not sure how thrilled either Boyd nor Lall would be at using dual programs in 104, even in a quasi-black-box optimizer), but it may be of interest to people who take interest in relatively uninteresting things (e.g. me) or to people who happen to have *really freaking fast* proximal gradient optimizers (not sure these exist, but we can pretend).

[1] | We're also finding this by using cross-validation, rather than a-priori, so it doesn't matter too much. |

[2] | Note that this is equivalent to the original problem, actually. The forwards direction is an immediate generalization. The backwards direction is a little more difficult, but can be made simple by considering, say, only positive examples. |

Anyways, we left off on the idea that we now have a function which we wish to optimize, along with a sequence of constants \(C\) which tends to a given solution—in particular, and perhaps most importantly, we only care about the solution in the limit \(C\to\infty\).

As before, though, we can't just optimize the function

\[ \mathcal{L}(x; c, R, C) = \sum_{i}\left[\sum_j\phi\left(C\left(\frac{\lVert x_i - c_j \lVert_2^2}{R_j^2} - 1\right)\right) + \eta \lVert x_i - x_{i+1}\lVert_2^2\right] \]over some large \(C\), since we've noted that the objective becomes almost-everywhere flat in the limit^{[1]} and thus becomes very difficult to optimize using typical methods. Instead, we optimize over the sequence of functions, for \(C_k \to \infty\),

picking only

\[ x^* = \lim_{k\to\infty} x^{(k)} \]as our final trajectory. The goal of this post is to explore some methods for optimizing this function in the constrained, embedded environment which we'll be using for the competition.

I'll do a quick introduction to gradient descent since there are plenty of posts on this method, many of which I suspect are much better than anything I'll ever write.

Anyways, the simple idea (or another perspective on it) is that, if we want to find the minimum of a function \(V\), then we can think about the function as a potential for a particle, whose position we call \(x\), and then just run Newton's equation forward in time!

\[ m\ddot x = -\nabla V(x) \]where \(\ddot x = \frac{d^2x}{dt^2}\) (this is just a rewriting of \(F=ma=m\ddot x\), where our force is conservative). You may notice a problem with this idea: well, if we land in a well, we'll continue oscillating... that is, there's literally no friction to stop us from just continuing past the minimum. So, let's add this in as a force proportional to the velocity (but pointing in the opposite direction), with friction coefficient \(\mu>0\):

\[ m\ddot x = -\nabla V(x) - \mu \dot x. \]Now, here I'll note we can do two things: one, we can keep the former term containing acceleration (i.e. momentum), accepting that we could possibly overshoot our minimum (because, say, we're going 'too fast') but then later 'come back' to it (this is known as gradient descent with momentum),^{[2]} or, if we never want to overshoot it (but allow for the possibility that we may always be too slow in getting there in the first place) we can just send our momentum term to zero by sending \(m \to 0\). I'll take the latter approach for now, but we'll consider the former case, soon.

Anyways, sending \(m\to 0\) corresponds to having a ball slowly rolling down an extremely sticky hill, stopping only at a local minimum, that is:

\[ \mu\dot x + \nabla V(x) = 0 \]or, in other words:

\[ \dot x = -\frac{1}{\mu}\nabla V(x). \]Discretizing this equation by noting that, by definition of the derivative, we have

\[ \dot x(t_{i+1}) \approx \frac{x_{i+1} - x_i}{h} \]then gives us (by plugging this into the above)

\[ \frac{x_{i+1} - x_i}{h} = -\frac{1}{\mu}\nabla V(x), \]or, after rearranging (and setting \(\mu=1\), since we can control \(h\) however we like, say by defining \(h := \frac{h}{\mu}\))

\[ x_{i+1} = x_i - h\nabla V(x). \]In other words, gradient descent corresponds to the discretization of Newton's equations in the *overdamped* limit (e.g. in the limit of small mass and large friction).

This method is great because (a) we know it converges with probability 1 (as was relatively recently proven here) for arbitrary, somewhat nice functions and (b) because it *works*. That being said, it's slow; for example, in the previous post, we saw that it converged after 5000 iterations (which, to be fair, takes about 20 seconds on my machine, but still).

A simple improvement (where we don't throw \(m\to 0\)) yields a significant speed up! Of course, at the cost of having to deal with more hyperparameters, but that's okay: we're big kids now, and we can deal with more than one hyperparameter in our problems.

The next idea is to, instead of taking \(m\to 0\), just write out the full discretization scheme. To make our lives easier, we rewrite \(v(t) \equiv \dot x(t)\) to be our velocity, this gives us a simple rewriting of the form:

\[ \begin{align} m\dot v &= -\nabla V(x) - \mu v\\ \dot x(t) &= v(t) \end{align} \]discretizing the second equation with some step-size \(h'\) (as above) we get

\[ x_{t+1} = x_t + h'v_{t+1} \]where the former equation is, when discretized with some step size \(h\)

\[ m\frac{v_{t+1} - v_t}{h} = -\nabla V(x_t) - \mu v_t \]or after rearranging, and defining \(\gamma \equiv \frac{h}{m}\) (which we can make as small as we'd like)

\[ \begin{align} v_{t+1} &= -\gamma \nabla V(x_t) + (1-\mu \gamma) v_t\\ x_{t+1} &= x_t + h'v_{t+1} \end{align} \]usually we take \(h' = 1\), and, to prevent \(v_t\) from having weird behaviour, we require that \(1-\mu\gamma > 0\), i.e. that \(\gamma < \frac{1}{\mu}\).^{[3]} If we call \(\beta \equiv 1 - \mu\gamma\) and therefore have that \(0 < \beta < 1\) then we obtain the classical momentum for gradient descent

which is what we needed! Well... close to what we needed, really.

Anyways, just to give some perspective on the speed up: using momentum, the optimization problem took around 600 iterations to converge, more than 8 times less than the original given above. I'll give a picture of this soon, but I'm missing one more slight detail.

Imagine we want to optimize some function \(\ell(\cdot)\) that is, in general, extremely hard to solve. If we're lucky, we may be able to do the next best thing: take a series of functions parametrized by, say, \(C\), such that \(\ell_C(\cdot) \to \ell(\cdot)\) as \(C\to\infty\),^{[4]}*and* where the problem is simple to solve for \(C_{k+1}\), given the solution for \(C_k\).

Of course, given this and the above we can already solve the problem: we begin with some small \(C_0\) and then, after converging for \(\ell_{C_0}(\cdot)\) we then continue to \(\ell_{C_1}(\cdot)\), after converging to that, we then continue to solve for \(\ell_{C_2}(\cdot)\), etc., until we reach some desired tolerance on the given result.

Or... (of course, I'm writing this for a reason), we could do something fancy using the previous scheme:

Every time we update our variable, we also increase \(k\) such that both the problem sequence and the final solution converge at the same time. It is, of course, totally not obvious that this works (though with some decent choice of schedule, one could imagine it should); the video below shows this idea in action using both momentum and this particular choice of cooling scheme (note the number of iterations is much lower relative to the previous attempt's 5000, but also note that, while the scheme converged in the norm—that is, the variables were updated very little—it didn't actually converge to an optimal solution, but it was pretty close!).

I'd highly recommend looking at the code in order to get a better understanding of how all this is implemented and the dirty deets.

Anyways, optimizing the original likelihood presented in the first post (and in the first part of *this* post) using momentum and the above cooling schedule yields the following nice little video:

<video controls> <source src="/images/path-optimization-2/path*optimization*2.mp4" type="video/mp4"> </video>

As before, this code (with more details and implementation) can be found in the StanfordAIR Github repo.

[1] | This is, indeed, a technical term, but it's also quite suggestive of what really happens. |

[2] | Of course, there are many reasons why we'll want momentum, but those will come soon. |

[3] | Consider \(V(x) = 0\), with some initial condition, \(v_0 > 0\), say, then we'll have
\[ v_t = -k v_{t-1} \]
for some \(k = \mu\gamma - 1>0\). Solving this yields \(v_{t} = (-1)^tk^t v_{0}\). This is weird, because it means that our velocity will change directions every iteration even though there's no potential! This is definitely not expected (nor desirable) behaviour. |

[4] | In some sense. Say: in the square error, or something of the like. This can be made entirely rigorous, but I choose not to do it here since it's not terribly essential. |

That being said, if anyone has any papers that I should *definitely* read, please do send them over to my Twitter (below) or email, etc.; whatever floats your boat. I feel like I have a very limited view of the current state of the field, so I'd always love to learn more! That being said, a cursory search through Google Scholar isn't as productive as I would've thought.

Anyways, let's get to something more interesting. I believe I'll be splitting this post up into some small set (say, 3, though this may change) posts explaining individual parts and more prickly details of the algorithm, but for now I'll just share the big idea and dive into the last part (which I argue is the hardest case).

Essentially the problem will be broken down into three basic steps (and a fourth "looping" step):

Discretize the space and goals into a graph problem which is guaranteed to be (a)

*damn fast to solve*and (b) to always give a feasible result (minus a curvature constraint—that will come in later).Make the resulting path through the graph into an ordered set of points \(x_i \in \mathbb{R}^2\) (or \(\mathbb{R}^3\), depending on what problem needs to be solved) through actual Euclidean space.

Perform continuous optimization starting at this resulting path in order to meet curvature constraints and add some 'finishing touches' (this will be formalized in a second, don't worry).

Do \((3)\) for moving objects, for a while, as \((1) \to (2)\) are solved again, simultaneously.

In this post, I'll mostly focus on step \((3)\), which is actually all you need to truly optimize over a path (along with some cute other heuristics), though steps \((1)\) and \((2)\) are also really just fast heuristics so we don't get stuck in crappy minima that would take us through the middle of an obstacle. I'll show how this can happen in non-obvious ways which is kinda fun for the first few times and mostly infuriating for the rest of the time (which is why we end up going through \((1)\) and \((2)\) in the end!).

Perhaps the main idea of this step is that we can optimize over some function (which isn't quite a hard-wall constraint) and then slowly tune a parameter until it becomes a better and better approximation of a hard wall; for this example I've chosen the (reversed) logistic function

\[ \phi(x) = \frac{1}{1+e^{x}} \]such that two things happen: one, that \(\phi(x) \to 0\) as \(x\to \infty\) and \(\phi(x) \to 1\) as \(x\to -\infty\), and, two, that \(\phi(Cx)\) approximates a hard wall as \(C\to \infty\). Below is \(\phi(Cx)\) plotted for a few different values of \(C\):

<img src="/images/path-optimization-1/phi_curvature.png" class="plot"> *Barrier functions for varying curvatures \(C\).*

The idea is that the smooth problem should be easy to solve and we can get consistently better approximations by starting at the easy problem and solving a sequence of problems which, in the limit, give the desired path.

More generally speaking, let the obstacles be centered at some set of points \(\\{c\_j\\}\), each with some radius \(R\_j\), then a single constraint corresponds to the barrier of curvature \(C\) given by (where the object is at position \(x\))

\[ \phi\left(C\left(\frac{\lVert x - c_j \lVert_2^2}{R_j^2} - 1\right)\right) \]which, if we assume that our path is characterized by an ordered set of points \(\\{x_i\\}\), gives our complete energy function to be

\[ \mathcal{L}(x; c, R, C) = \sum_{ij}\phi\left(C\left(\frac{\lVert x_i - c_j \lVert_2^2}{R_j^2} - 1\right)\right) \]which is really just a fancy way of writing "each discretized point in my path should be outside of an obstacle." This is *close* to what we want, but it's not quite there yet: we aren't penalizing for being arbitrarily far away from other points—that is, if we just put all of our \(\\{x_i\\}\) at infinity, we now have zero penalty!

Of course, that's a pretty stupid path that no drone can take (especially if we're constrained to be in some particular region, which, in this case, we are), so we do the next straightforward thing: we also penalize any point being far away from its adjacent points. E.g. we add a penalty term of the form \(\eta\lVert x\_i - x\_{i+1}\lVert_2^2\) for \(\eta>0\).

In this case, our complete energy function then looks like

\[ \mathcal{L}(x; c, R, C) = \sum_{i}\left[\sum_j\phi\left(C\left(\frac{\lVert x_i - c_j \lVert_2^2}{R_j^2} - 1\right)\right) + \eta \lVert x\_i - x\_{i+1}\lVert_2^2\right] \]with a 'tunable' parameter \(\eta\), and constraint wall 'hardness' \(C\) which we send to infinity as we solve a sequence of problems. That is, let \(\\{C\_k\\}\) be a sequence such that \(C_k\to \infty\) then we solve the sequence of problems

\[ x^{(k)} = \min\_x\mathcal{L}(x; c, R, C_k) \]and take the trajectory

\[ x^* = \lim\_{k\to\infty} x^{(k)} \]in the limit. Why do we do this? Because the derivative of \(\mathcal{L}\) vanishes as \(C\to\infty\) for the hard constraints. This can be seen in the picture above, by looking at the left side; as \(C\) becomes large, the function becomes essentially flat when \(x<0\) and \(x>0\). This is generally bad, since, if we were to optimize directly for some very large \(C\) which goes through the interior of an obstacle, we would be near a point where the derivative nearly vanishes even though we're inside of an obstacle!

This is totally infeasible for our problem and we cannot sidestep this issue in an obvious way using general optimization tools. So we're forced to do the Next Best Thing™, which is to perform this cooling schedule idea while optimizing over the objective.^{[1]}

Anyways, optimizing this function somewhat successfully with some decent cooling schedule (which is the subject of the next post) yields a cute movie that looks like the following

<video controls> <source src="/images/path-optimization-1/path_optimization.mp4" type="video/mp4"> </video>

Don't be fooled, though: there's plenty of little experiments that didn't work out while making this. Robustness is a huge reason why optimizing just this objective would take way too long and, hence, why we require the heuristics mentioned above (and which I'll soon discuss!).

A general overview of the code (with more details and implementation) can be found in the StanfordAIR Github repo.

[1] | As given before, we can create feasible trajectories which do not have this problem by discretization methods—this helps out quite a bit since, for complicated trajectories where a lot of the initial path intersects obstacles, most of the time is spent on either (a) making a good cooling schedule for \(C\) or (b) escaping the minima which include local obstacles. I'll discuss these methods in a later post. |

*If you're interested in looking at the results first, I'd recommend skipping the following section and going immediately to the next one, which shows the application.*

So, let's dive right in!

The usual least-squares many of us have heard of is a problem of the form

:$\min_x \,\,\lVert Ax - b \lVert^2 $

where I define \(\lVert y\lVert^2 \equiv \sum_i y_i^2 = y^Ty\) to be the usual Euclidean norm (and $y^T$ denotes the transpose of \(y\)). This problem has a unique solution provided that \(A\) is full-rank (i.e. has independent columns), and therefore that \(A^TA\) is invertible.^sq-invertible This is true since the problem above is convex (e.g. any local minimum, if it exists, corresponds to the global minimum^convex-global-min), coercive (the function diverges to infinity, and therefore *has* a local minimum) and differentiable such that

or that, after rearranging the above equation,

\[ A^TAx = A^Tb. \]This equation is called the *normal equation*, which has a unique solution for \(x\) since we said \(A^TA\) is invertible. In other words, we can write down the (surprisingly, less useful) equation for \(x\)

A simple example of direct least squares can be found on the previous post, but that's nowhere as interesting as an *actual* example, using some images, presented below. First to show the presented example is possible, I should note that this formalism can be immediately extended to cover a (multi-)objective problem of the form, for \(\lambda_i > 0\)

by noting that (say, with two variables, though the idea extends to any number of objectives), we can pull the \(\lambda_i\) into the inside of the norm and observing that

\[ \lVert a\lVert^2 + \lVert b \lVert^2 = \left \lVert \begin{bmatrix} a\\ b \end{bmatrix} \right\lVert^2. \]So we can rewrite the above multi-objective problem as

\[ \lambda_1\lVert A_1x - b_1 \lVert^2 + \lambda_2\lVert A_2x - b_2 \lVert^2 = \left\lVert \begin{bmatrix} \sqrt{\lambda_1} A_1\\ \sqrt{\lambda_2} A_2 \end{bmatrix} x - \begin{bmatrix} \sqrt{\lambda_1}b_1\\ \sqrt{\lambda_2}b_2 \end{bmatrix}\right\lVert^2. \]Where the new matrices above are defined as the 'stacked' (appended) matrix of \(A_1, A_2\) and the 'stacked' vector \(b_1, b_2\). Or, defining

\[ \bar A \equiv \begin{bmatrix} \sqrt{\lambda_1} A_1\\ \sqrt{\lambda_2} A_2 \end{bmatrix} \]and

\[ \bar b \equiv \begin{bmatrix} \sqrt{\lambda_1} b_1\\ \sqrt{\lambda_2} b_2 \end{bmatrix}, \]we have the equivalent problem

\[ \min_x \,\, \lVert \bar A x - \bar b\lVert^2 \]which we can solve by the same means as before.

This 'extension' now allows us to solve a large amount of problems (even equality-constrained ones! For example, say \(\lambda_i\) corresponds to an equality constraint, then we can send \(\lambda_i \to \infty\), which, if possible, sends that particular term to zero^eq-constraint), including the image reconstruction problem that will be presented below. Yes, there are better methods, but I can't think of many that can be written in about 4 lines of Python with only a linear-algebra library (not counting loading and saving, of course 😉).

Let's say we are given a blurred image, represented by some vector \(y\) with, say, a gaussian blur operator given by \(G\) (which can be represented as a matrix). Usually, we'd want to minimize a problem of the form

\[ \min_x \,\,\lVert Gx - y \lVert^2 \]where \(x\) is the reconstructed image. In other words, we want the image \(x\) such that applying a gaussian blur \(G\) to \(x\) yields the closest possible image to \(y\). E.g. we really want something of the form

\[ Gx \approx y. \]Writing this out is a bit of a pain, but it's made a bit easier by noting that convolution with a 2D gaussian kernel is separable into two convolutions of 1D (e.g. convolve the image with the up-down filter, and do the same with the left-right) and by use of the Kronecker product to write out the individual matrices.^kronecker-conv The final \(G\) is therefore the product of each of the convolutions. Just to show the comparison, here's the initial image, taken from Wikipedia

<img src="/images/constrained-ls-intro/initial_image.png" class="insert"> *Original greyscale image*

and here's the image, blurred with a 2D gaussian kernel of size 5, with \(\sigma = 3\)

<img src="/images/constrained-ls-intro/blurred_image.png" class="insert"> *Blurred greyscale image. The vignetting comes from edge effects.*

The kernel, for context, looks like:

<img src="/images/constrained-ls-intro/gaussian_kernel.png" class="insert"> *2D Gaussian Kernel with \(N=5, \sigma=3\)*

Solving the problem, as given before, yields the final (almost identical) image:

<img src="/images/constrained-ls-intro/reconstructed_image.png" class="insert"> *The magical reconstructed image!*

Which was nothing but solving a simple least squares problem (as we saw above)!

Now, you might say, "why are we going through all of this trouble to write this problem as a least-squares problem, when we can just take the FFT of the image and the gaussian and divide the former by the latter? Isn't convolution just multiplication in the Fourier domain?"

And I would usually agree!

Except for one problem: while we may *know* the gaussian blurring operator on artificial images that *we* actively blur, the blurring operator that we provide for real images may not be fully representative of what's really happening! By that I mean, if the real blurring operator is given by \(G^\text{real}\), it could be that our guess \(G\) is far away from \(G^\text{real}\), perhaps because of some non-linear effects, or random noise, or whatever.

That is, we know what photos, in general, look like: they're usually pretty smooth and have relatively few edges. In other words, the variations and displacements aren't large almost everywhere in most images. This is where the multi-objective form of least-squares comes in handy—we can add a secondary (or third, etc) objective that allows us to specify how smooth the actual image should be!

How do we do this, you ask? Well, let's consider the gradient at every point. If the gradient is large, then we've met an edge since there's a large color variation between one pixel and its neighbour, similarly, if the gradient is small at that point, the image is relatively smooth at that point.

So, how about specifying that the sum of the norms of the gradients at every point be small?^heat-diffusion That is, we want the gradients to *all* be relatively small (minus those at edges, of course!), with some parameter that we can tune. In other words, let \(D_x\) be the difference matrix between pixel \((i,j)\) and pixel \((i+1,j)\) (e.g. if our image is \(X\) then \((D_x X)_{ij} = X_{i+1,j} - X_{ij}\), and, similarly, let \(D_y\) be the difference matrix between pixel \((i, j)\) and \((i,j+1)\).^derivative-mat Then our final objective is of the form

where \(\lambda \ge 0\) is our 'smoothness' parameter. Note that, if we send \(\lambda \to \infty\) then we really care that our image is 'infinitely smooth' (what would that look like?^smooth-image), while if we send it to zero, we care that the reconstruction from the (possibly not great) approximation of \(G^\text{real}\) is really good. Now, let's compare the two methods with a slightly corrupted image:

<img src="/images/constrained-ls-intro/corrupted*blurred*image.png" class="insert"> *The corrupted, blurred image we feed into the algorithm*

<img src="/images/constrained-ls-intro/initial_image.png" class="insert"> *Original greyscale image (again, for comparison)*

<img src="/images/constrained-ls-intro/smoothed*corrupted*reconstructed*image*l=1e-07.png" class="insert"> *Reconstruction with \(\lambda = 10^{-7}\)*

<img src="/images/constrained-ls-intro/corrupted*reconstructed*image.png" class="insert"> *Reconstruction with original method*

Though the normalized one has slightly larger grains, note that, unlike the original, the contrast isn't as heavily lost and the edges, etc, are quite a bit sharper.

We can also toy a bit with the parameter, to get some intuition as to what all happens:

<img src="/images/constrained-ls-intro/smoothed*corrupted*reconstructed*image*l=0.001.png" class="insert"> *Reconstruction with \(\lambda = 10^{-3}\)*

<img src="/images/constrained-ls-intro/smoothed*corrupted*reconstructed*image*l=1e-05.png" class="insert"> *Reconstruction with \(\lambda = 10^{-5}\)*

<img src="/images/constrained-ls-intro/smoothed*corrupted*reconstructed*image*l=1e-10.png" class="insert"> *Reconstruction with \(\lambda = 10^{-10}\)*

Of course, as we make \(\lambda\) large, note that the image becomes quite blurry (e.g. 'smoother'), and as we send \(\lambda\) to be very small, we end up with the same solution as the original problem, since we're saying that we care very little about the smoothness and much more about the reconstruction approximation.

To that end, one could think of many more ways of characterizing a 'natural' image (say, if we know what some colors should look like, what is our usual contrast, etc.), all of which will yield successively better results, but I'll leave with saying that LS, though relatively simple, is quite a powerful method for many cases. In particular, I'll cover fully-constrained LS (in a more theoretical post) in the future, but with an application to path-planning.

Hopefully this was enough to convince you that even simple optimization methods are pretty useful! But if I didn't do my job, maybe you'll have to read some future posts. ;)

If you'd like, the complete code for this post can be found here.

A nice picture usually helps with this:

Each of the hyperplanes (which are taken at the open, red circles along the curve; the hyperplanes themselves denoted by gray lines) always lies below the graph of the function (the blue parabola). We can write this as $$ f(y)\ge (y-x)^T(\nabla f(x)) + f(x) $$ for all $x, y$.This is usually taken to be the *definition* of a convex function, so we'll take it as such here. Now, if the point $x^0$ is a local minimum, we must have $\nabla f(x^0) = 0$, this means that $$ f(y) \ge (y-x^0)^T(\underbrace{\nabla f(x^0)}_{=0}) + f(x^0) = (y-x^0)^T0 + f(x^0) = f(x^0), $$ for any $y$.

In other words, that

\[ f(y) \ge f(x^0), \]for any \(y\). But this is the definition of a global minimum since the point \(f(x^0)\) is less than any other value the function takes on! So we've proved the claim that any local minimum (in fact, more strongly, that any point with vanishing derivative) is immediately a global minimum for a convex function. This is what makes convex functions so nice!

$ G = (T\otimes I*{m+n})(I*m \otimes T) $

which is much simpler to compute than the horrible initial expression dealing with indices. Additionally, these expressions are sparse (e.g. the first is block-diagonal), so we can exploit that to save on both memory and processing time. For more info, I'd recommend looking at the code for this entry.

\[ L=\begin{bmatrix} 1 & -1 & 0 & 0 & \dots\\ 0 & 1 & -1 & 0 & \dots\\ 0 & 0 & 1 & -1 & \dots\\ \vdots & \vdots & \vdots & \vdots & \ddots \end{bmatrix} \]

this is relatively straightforward to form using an off-diagonal matrix. Note also that this matrix is quite sparse, which saves us from very large computation. Now, we want to apply this to each row (column) of the 2D image which is in row-major form. So, we follow the same idea as before and pre (post) multiply by the identity matrix. That is, for an \(m\times m\) image

\[ D_x = I_m\otimes L \]and

\[ D_y = L\otimes I_m \] ]]>For context, I'm currently leading a team on path planning for fixed-wing UAVs (I still don't really know who put me in charge of this stuff, or *why* for that matter–-overall, it seems pretty terrifying for them, but kinda fun for me), and I wondered why I hadn't actually seen least squares in many papers on fixed-wing control. I still haven't gotten an answer to the question, to be honest, but I did waste some potentially productive time showing that PID \(\subset\) LS. Some quick definitions: let \(u(t)\) be our control input and allow \(\varepsilon(t)\) to be the error of the function (that is, \(\varepsilon(t) = x(t) - \hat x(t)\) where \(x(t)\) is the desired position and \(\hat x(t)\) is the current position), then a PID controller is defined as

where each of the \(K_{(\cdot)}\) variables are a gain or proportionality constant. Say \(K_p\) is the proportional constant (i.e. how much of \(u\) is proportional to the current error), \(K_i\) is the integral proportionality constant (i.e. how much of \(u\) is proportional to the integral of the error), and \(K_d\) is the derivative constant (i.e. ditto). For a more thorough explanation for what each of these means intuitively, see the PID wikipedia page.

Anyways, I'll likely make a separate (more introductory) post to least squares but, for now, I define an LS problem to be an optimization problem of the form, for arbitrary but given \(A, b, \lambda_j>0, C, d\)

\[ \begin{aligned} & \underset{u}{\text{minimize}} & & \sum_j \lambda_j \lVert A_j u - b_j\lVert_2^2 \\ & \text{subject to} & & Cu = d \end{aligned} \]where the \(\lVert \cdot \lVert_2^2\) norm is the usual \(\ell_2\)-norm (i.e. \(\lVert x \lVert_2^2 = \sum_i x_i^2\)). It's notable that this problem has an analytical solution (not that you'd necessarily *want* the analytical solution for most big-enough scenarios) and is extremely well-behaved for most optimization methods.^{[2]} Now, consider the following objective function with trivial equality constraints (e.g. \(0=0\), for convenience, by setting \(C = (0,0,…,0)^T,\, d = 0\)) and \(K\) being some proportionality constant (I'll make the connection to the original \(K_{(\cdot)}\) variables above, soon):

Minimizing this function by setting its gradient to zero (this is necessary and sufficient by differentiability, convexity, and coerciveness [that is, \(E(u) \to \infty\), whenever \(\lVert u\lVert \to \infty\)]) gives the solution^{[3]}

or, after rearranging

\[ u = \frac{K}{\lambda_p + \lambda_i + \lambda_d}\left(\lambda_p \varepsilon(t) + \lambda_i\int d\tau\, \varepsilon(t) + \lambda_d \frac{\partial\varepsilon(t)}{\partial t}\right), \]which allows the following correspondence between the original PID and the LS problem to be

\[ \begin{aligned} K &= K_p + K_i + K_d\\ \lambda_p &= \frac{K_p}{K}\\ \lambda_i &= \frac{K_i}{K}\\ \lambda_d &= \frac{K_d}{K}. \end{aligned} \]So, now we've given the condition we wanted and we're done!

Anyways, you may ask, why is this useful? I guess it kind of extends the framework to add constraints from your control surface, or secondary objectives. To be completely honest, though? I have no idea.

[1] | I mostly know people who do hardware work, etc. on UAVs, so I don't really have a representative sample of control people. |

[2] | PolitiFact: Mostly true. I mean the usual cases (e.g. first-order methods, second-order methods, or conjugate gradient/quasi-newton methods). It's horribly behaved in conic program (SOCP) solvers. |

[3] | There's an immediate generalization here: any control of the form \(\sum_i \gamma_i\left(u - C_i\right), \gamma_i>0\) can be immediately written as the minimizer to an energy function \(E(u) = \sum_i \eta_i\lVert u - \xi C_i\lVert^2_2\). We can actually go further and note there's yet another generalization to any control of the form \(\sum_i \left(S_iu - C_i\right)\), where each \(S_i\) is symmetric and (strictly) positive definite. This is true as each \(S_i\) has an inverse and a 'square root' matrix (e.g. let, \(S\) be some positive-definite matrix. We know \(SV = V\Lambda\) for \(V^TV = VV^T = I\) and diagonal \(\Lambda > 0\), thus \(S^{1/2}=V\Lambda^{1/2}V^T\)), such that the energy function is written in terms of these. Though it's somewhat enlightening (I guess), I leave the derivation as an exercise for the reader. |

404

The requested page was not found

Click here to go back to the homepage.

]]>