Longest Path Search

Comparisons and information theory: why the worst-case complexity of sorting is n log n

Posted 2020-06-19

In this post, we'll talk a bit about the usual proofs about the worst-case query complexity of sorting (at least, in the deterministic case) and then use a beautiful (and surprisingly simple!) tool from computational lower bounds to give a very general argument about the construction

The usual approach

There are many notes and posts talking about the fact that sorting will, in the worst case, always take Ω(nlogn)\Omega(n \log n) comparisons; equivalently, in the worst case, the number of comparisons is about a constant factor away from nlognn \log n when nn is large. In many cases, the proof presented depends on the fact that the sorting algorithm is deterministic and goes a little like this (see here for a common case):


Let CkC_k be the kkth step of a deterministic sorting algorithm (this can be represented, for example, as a tuple containing what the next comparison should be) with input LKnL \in K^n where LL is a list of nn comparable elements. (For example, LL can be a list of real numbers, in which case K=RK = \mathbb{R}.)

By the definition of a deterministic algorithm, CkC_k depends only on the past k1k-1 comparisons; i.e., Ck(Ck1,Ck2,,C1)C_k(C_{k-1}, C_{k-2}, \dots, C_1). (I am slightly overloading notation here, of course, but the meaning should be clear.) This means that we can view the behavior of the algorithm as a tree, where CkC_k is a child node of Ck1C_{k-1} which itself is a child node of Ck2C_{k-2}, etc. Additionally, the tree is binary since the output of a comparison is only one of two possibilities (if Ck=(a,b)C_k = (a, b), then either aba\le b or bab \le a).

Finally, let the leaf nodes of this tree be the list of indices (say, pp, where each entry is an index, pi{1,,n}p_i \in \{1, \dots, n\} for i=1,,ni=1, \dots, n) such that the list permuted at these indices is sorted, Lp1Lp2,LpnL_{p_1} \le L_{p_2} \le \dots, \le L_{p_n}. Note that the number of nodes needed to get from the root node to a given leaf node (or permutation) is exactly the number of comparisons that the algorithm makes before it returns a specific permutation. If we can show that the height (the length of the longest path from root to leaf) of the tree is always larger than about nlognn \log n, then we've shown that this algorithm must take at least nlognn \log n steps.

Bound on the height

The idea for the bound is pretty simple: since this algorithm is a sorting algorithm and it can receive any unsorted list, then each of the n!n! possible permutations must be a leaf node of the tree. (Why?) Additionally, the maximum number of leaves for a binary tree of height hh is 2h2^h, which, in turn, implies that we must have 2hn!2^h \ge n!. Taking the log of both sides shows that:

h=log(2h)log(n!)=Ω(nlogn), h = \log(2^h) \ge \log(n!) = \Omega(n\log n),

which is exactly what we wanted to show. (The second equality is shown, for example, in the original reference for this proof; see page 2.)

Some takeaways from this proof

To prove this statement we only really used a few things: (a) the algorithm has to decide between n!n! things, and (b) at each query, it only receives a "yes" or a "no" (as we will soon make rigorous, it only gains 1 bit of information from each query). The rest of the proof simply sets up scaffolding for the remaining parts, most of which is really somewhat orthogonal to our intuition. The point is: look, we have n!n! things we have to decide on and every time we ask a question, we cut down our list of possible true answers by about half. How many times do we need to cut down our list to be able to have exactly one possible answer? (Of course, as we showed before, this should be log2(n!)\log_2(n!).)

Now, a good number of writings (and some textbooks) I've seen simply end here by saying "the algorithm gains at most 1 bit for every query and we need at least log(n!)nlogn\log(n!) \sim n \log n bits" and then give some vague citation to Shannon's theorem about communication without explicitly writing out any argument. (While there is some connection, it's unclear what it is or how to even make it rigorous in a general sense.) It's a bit of a shame since it doesn't really take much math to fully justify this intuitive statement, as we will see next.

An information-theoretic approach

The idea behind a (simple) information theoretic approach is to view the algorithm as attempting to 'uncover' the true, sorted permutation by querying an oracle that knows whether two elements are in sorted order or not. Here, the oracle gets to choose some 'true' permutation uniformly at random and then the sorting algorithm queries the oracle in a restricted sense: it can only ask yes or no questions about the list.

More formally, we will let X\mathcal X be the set of all n!n! possible permutations of 11 through nn and let XX the oracle's permutation such that XXX \sim \mathcal X is uniformly randomly sampled from X\mathcal X. Then, the algorithm gets to ask a sequence of yes/no queries Yi{0,1}Y_i \in \{0, 1\} for i=1,,ki=1, \dots, k which are dependent on the permutation XX, and, at the end of kk queries, must respond with some 'guess' for the true permutation X^\hat X, which is a random variable that depends only on YY.

We can represent the current set up as a Markov chain XYX^X \to Y \to \hat X, since X^\hat X is conditionally independent of XX given YY (i.e., the algorithm can only use information given by YY) while the random variable YY depends only on XX. The idea then is to give a bound on the number of queries kk required for an algorithm to succeed with probability 1.

To give a lower bound on this quantity, we'll use a tool from information theory called Fano's inequality, which, surprisingly, I don't often see taught in information theory courses. (Perhaps I haven't been taking the right ones!)

Fano's inequality

I learned about this lovely inequality in John Duchi's class, EE377: Information Theory and Statistics (the lectures notes for the class are here). Its proof really makes it clear why entropy, as defined by Shannon, is pretty much the exact right quantity to look at. We'll explore a weaker version of it here that is simpler to prove and requires fewer definitions but which will suffice for our purposes.

We will use X,YX, Y as random variables and set HH as the entropy, defined:

H(X)=XXP(X)logP(X), H(X) = -\sum_{X \in \mathcal{X}}P(X) \log P(X),

where X\mathcal{X} is the space of values that XX can take on. The conditional entropy of XX given YY is defined as

H(XY)=YYP(Y)XXP(XY)logP(XY)=H(X,Y)H(Y). H(X|Y) = -\sum_{Y \in \mathcal Y} P(Y) \sum_{X \in \mathcal X} P(X\mid Y) \log P(X\mid Y) = H(X, Y) - H(Y).

As usual, the entropy HH is a measure of the 'uncertainty' in the variable XX, with the maximally uncertain distribution being the uniform one.[1] Additionally, note that H(X,Y)H(X, Y) is the entropy taken with respect to the joint distribution of XX and YY. Finally, if XX is a deterministic function of YY, then H(XY)=0H(X\mid Y) = 0 which follows from the definition.

For this post, we will only make use of the following three properties of the entropy (I will not prove them here, as there are many available proofs of them, including the notes above and Cover's famous Elements of Information Theory):

  1. The entropy can only decrease when removing variables, H(X,Y)H(X)H(X, Y) \ge H(X).

  2. The entropy is smaller than the log of the size of the sample space H(XY)H(X)logXH(X \mid Y) \le H(X) \le \log|\mathcal X|. (Equivalently, conditioning reduces entropy, and the uniform distribution on X\mathcal X has the highest possible entropy, logX\log|\mathcal X|.)

  3. If a random variable X^\hat X is conditionally independent of XX given YY (i.e., if XYX^X \to Y \to \hat X is a Markov chain), then H(XX^)H(XY)H(X\mid \hat X) \ge H(X\mid Y). This is often called a data processing inequality, which simply says that XX has smaller entropy knowing YY than knowing a variable, X^\hat X, that has undergone further processing. In other words, you cannot gain more information about XX from a variable YY by further processing YY.

This is all we need to prove the following inequality. Let XYX^X \to Y \to \hat X be a Markov chain such that X^\hat X is conditionally independent of XX given YY and XX is uniformly drawn from X\mathcal X, then the probability that XX^X \ne \hat X is given by PeP_e and PeP_e satisfies

Pe1k+1logX, P_e \ge 1 - \frac{k+1}{\log |\mathcal X|},

where kk is the number of binary queries made and X|\mathcal X| is the number of elements in X\mathcal X.

A quick aside

Proving this inequality is enough to show the claim. Note that, we want the probability of error to be 0 (since we want our algorithm to work!) so

1k+1logX0impliesk+1logX, 1 - \frac{k+1}{\log |\mathcal X|} \le 0 \quad \text{implies} \quad k+1 \ge \log |\mathcal X|,

and, since X|\mathcal X| is the space of possible permutations of size nn, (of which there are n!n! of) then X=n!|\mathcal X| = n! and logn!Ω(nlogn)\log n! \sim \Omega(n \log n) then kk must satisfy

k=Ω(nlogn). k = \Omega(n \log n).

In other words, the number of queries (or comparisons, or whatever) kk must be approximately at least as large as nlognn \log n, asymptotically (up to constant multiples). In fact, this proves a slightly stronger statement that no probabilistic algorithm can succeed with nonvanishing probability if the number of queries is not on the order of nlognn \log n, which our original proof above does not cover!

Now, the last missing piece is showing that the probability of error is bounded from below.

A proof of Fano's inequality

This is a slightly simplified version of the proof presented in Duchi's notes (see section 2.3.2) for the specific case we care about, which requires less additional definitions and supporting statements. Let EE be the 'error' random variable that is 11 if XX^X \ne \hat X and 0 otherwise, then let's look at the quantity H(E,XX^)H(E, X\mid \hat X):

H(E,XX^)=H(XX^,E)+H(EX^), H(E, X\mid \hat X) = H(X\mid \hat X, E) + H(E\mid \hat X),

by our definition of conditional entropy, and

H(XX^,E)=PeH(XX^,E=1)+(1Pe)H(XX^,E=0), H(X\mid \hat X, E) = P_e H(X\mid \hat X, E=1) + (1-P_e)H(X\mid \hat X, E=0),

again by the same definition. Since X=X^X = \hat X whenever there is no error, E=0E=0, then H(XX^,E=0)=0H(X\mid \hat X, E=0) = 0 since XX is known, so, we really have

H(E,XX^)=PeH(XX^,E=1)+H(EX^). H(E, X\mid \hat X) = P_e H(X\mid \hat X, E=1) + H(E\mid \hat X).

Since EE can only take on two values, we have that H(EX^)log(2)=1H(E\mid \hat X) \le \log(2) = 1 and we also have that H(XX^,E=1)logXH(X\mid \hat X, E=1) \le \log |\mathcal X|, which gives

H(E,XX^)PelogX+1. H(E, X\mid \hat X) \le P_e \log|\mathcal X| + 1.

Now, we have that

H(E,XX^)H(XX^)H(XY). H(E, X\mid \hat X) \ge H(X\mid \hat X) \ge H(X\mid Y).

The first inequality follows from the fact that we're removing a variable and the second follows from statement 3 in the previous section (as XYX^X \to Y \to \hat X). Using the definition of H(XY)H(X\mid Y), then we have

H(X,Y)H(Y)H(X)H(Y)=logXH(Y)logXlogY. H(X, Y) - H(Y) \ge H(X) - H(Y) = \log |\mathcal X| - H(Y) \ge \log |\mathcal X| - \log|\mathcal Y|.

The first inequality here follows since we're (again!) removing a variable and the equality follows from the fact that XX is uniformly randomly drawn from X\mathcal X and the last inequality follows from the fact that the entropy of YY is always smaller than that of the uniform distribution on Y\mathcal Y. Finally, note that, if we have kk queries, then Y=2k|\mathcal Y| = 2^k (this is the number of possible values a sequence of kk binary queries can take on). So, logY=k\log |\mathcal Y| = k (in other words, the maximum amount of information we can get with kk binary queries is kk bits) so we find

logXkPelogX+1, \log |\mathcal X| - k \le P_e\log |\mathcal X| + 1,

or, after some slight rewriting:

Pe1k+1logX, P_e \ge 1 - \frac{k+1}{\log |\mathcal X|},

as required!

Some discussion

I think overall that Fano's inequality is a relatively straightforward way of justifying a ton of the statements one can make about information theory without needing to invoke a large number of more complicated notions. Additionally, the proof is relatively straightforward (in the sense that it only requires very few definitions and properties of the entropy) while also matching our intuition about these problems pretty much exactly.

In particular, we see that sorting is hard not because somehow ordering elements is difficult, but because we have to decide between a bunch of different items (in fact, n!n! of them) while only receiving a few bits of information at any point in time! In fact, this bound applies to any yes/no query that one can make of the sorted data, not just comparisons, which is interesting.

There are some even more powerful generalizations of Fano's inequality which can also be extended to machine learning applications: you can use them to show that, given only some small amount of data, you cannot decide between a parameter that correctly describe the data and one that does not.

This is all to say that, even though entropy is a magical quantity, that doesn't mean we can't say very rigorous things about it (and make our intuitions about lower bounds even more rigorous, to boot!).

[1] In fact, the entropy is really a measure of how close a variable is to the uniform distribution, in the case of compact domains X\mathcal X—the higher the entropy, the closer it is.
Built with Franklin.jl and Julia.