Posted 2018-09-13
Note: This is another one of those "quick" posts about a topic I've found to be fascinating, but which is almost never discussed.
Physics has this nice little law called the second law of thermodynamics, which governs every physical thermodynamical system in question. The second law is usually phrased as the nice quote "everything tends to disorder," or some other variation thereof which sounds intuitive but is as opaque as a concrete wall when applied in practice.
As usual, I won't really touch this or other similar discussions with a 10 foot pole (though here is a good place to start), but I'll be giving some thoughts on a similar mathematical principle which arises from statistical systems.
Since this is a short post, I won't be describing Markov chains in detail, but as a refresher, a Markov chain (or Markov process) is a process in which the next state of the process depends only on the current state. In other words, it's a little like the game Snakes and Ladders, where your next position depends only on where you are in the current square and your next dice throw (independent of it being the 1st or 5th time you've landed on the same square).
In particular, we know probability of being at a new square, at time , given that we were in square at time . In other words, we know . Similarly, if we weren't sure of our position at time , but rather we have a probability distribution over positions, say (which I will call , for convenience), then the probability of being at position at time given our belief about possible positions, at time is
In other words, we just multiply these two probabilities and sum over all possible states at time . The defining trait of (stationary) Markov processes is that , which I will call from now on, is the same for all , and equation (1), now written as
holds for any .
It's probably not hard to believe that Markov chains are used absolutely everywhere, since they make for mathematically simple, but surprisingly powerful models of, well, everything.
This is all I will mention about Markov chains, in general. If you'd like a bit of a deeper dive, this blog post is a beautiful (and interactive!) reference. I highly recommend it.
Let be the distribution of a process at time , then the second law says that (and I will use statistics notation, rather than physics notation, from here on out!),
is non-decreasing as time, , increases. Note that I'll be dealing with discrete time and space here, but all of these statements with some modifications hold for continuous processes. Anyways, more generally, we can write
but it turns out this law, as stated, doesn't quite hold for many Markov processes. It does, on the other hand, hold for a set of processes where the transition probabilities are symmetric (more generally, this holds iff the transitions are doubly-stochastic. Cover has a slick, few-line proof of this which relies on some properties of the KL-divergence).
In this case, the probability of going from state to state is the same as the probability of going from state to state . Writing this out mathematically, it says:
I should note that this is a very strong condition, but it can be quite useful in giving a simple proof of the above law. To prove this, first note that the KL-divergence is nonnegative, since the negative log is convex, thus by Jensen's inequality (this is the same proof as the previous post):
since are probability distributions (i.e., nonnegative and sum to one).
Here is the magical trick to proving the above. Note that
so
which means
But, by definition, we have that
so
but, by Jensen's inequality (again!) on the left hand side we get,
Since we know , then we immediately have that
so, putting it all together
which is what we wished to prove.
So, while it turns out this law doesn't hold for general Markov processes, a very similar law does hold. If a Markov process has a stationary distribution, , then:
so, as the Markov chain continues evolving, the KL divergence between the current distribution and the equilibrium distribution never decreases.
In fact, more generally, for any two initial probability distributions , we have that
so any two distributions undergoing the same (Markovian) dynamics never decrease in KL-divergence! Even if the Markov process does not have a unique stationary distribution, there is still a type of second law which holds, in a very general sense.
As before, Cover has a fantastic, slick proof of the above, which I highly recommend you read!