Longest Path Search

Markov processes and the second law

Posted 2018-09-13

Note: This is another one of those "quick" posts about a topic I've found to be fascinating, but which is almost never discussed.

Physics has this nice little law called the second law of thermodynamics, which governs every physical thermodynamical system in question. The second law is usually phrased as the nice quote "everything tends to disorder," or some other variation thereof which sounds intuitive but is as opaque as a concrete wall when applied in practice.

As usual, I won't really touch this or other similar discussions with a 10 foot pole (though here is a good place to start), but I'll be giving some thoughts on a similar mathematical principle which arises from statistical systems.

Markov chains

Since this is a short post, I won't be describing Markov chains in detail, but as a refresher, a Markov chain (or Markov process) is a process in which the next state of the process depends only on the current state. In other words, it's a little like the game Snakes and Ladders, where your next position depends only on where you are in the current square and your next dice throw (independent of it being the 1st or 5th time you've landed on the same square).

In particular, we know probability of being at a new square, xt+1x_{t+1} at time t+1t+1, given that we were in square xtx_t at time tt. In other words, we know p(Xt+1=x  Xt=x)p(X_{t+1}=x~|~X_{t}=x'). Similarly, if we weren't sure of our position at time tt, but rather we have a probability distribution over positions, say p(Xt=x)p(X_t = x) (which I will call pt(x)p_t(x), for convenience), then the probability of being at position xx at time t+1t+1 given our belief about possible positions, xx' at time tt is

pt+1(x)=xp(Xt+1=x  Xt=x)pt(x).(1) p_{t+1}(x) = \sum_{x'} p(X_{t+1}=x~|~X_{t}=x')p_t(x').\tag{1}

In other words, we just multiply these two probabilities and sum over all possible states xx' at time tt. The defining trait of (stationary) Markov processes is that p(Xt+1=x  Xt=x)p(X_{t+1}=x~|~X_{t}=x'), which I will call K(x,x)K(x, x') from now on, is the same for all tt, and equation (1), now written as

pt+1(x)=xK(x,x)pt(x), p_{t+1}(x) = \sum_{x'} K(x, x')p_t(x),

holds for any tt.

It's probably not hard to believe that Markov chains are used absolutely everywhere, since they make for mathematically simple, but surprisingly powerful models of, well, everything.

This is all I will mention about Markov chains, in general. If you'd like a bit of a deeper dive, this blog post is a beautiful (and interactive!) reference. I highly recommend it.

A variation on a theme

Let ptp_t be the distribution of a process at time tt, then the second law says that (and I will use statistics notation, rather than physics notation, from here on out!),

H(pt)xpt(x)logpt(x), H(p_t) \equiv -\sum_x p_t(x) \log p_t(x),

is non-decreasing as time, tt, increases. Note that I'll be dealing with discrete time and space here, but all of these statements with some modifications hold for continuous processes. Anyways, more generally, we can write

H(pt+1)H(pt), H(p_{t+1}) \ge H(p_t),

but it turns out this law, as stated, doesn't quite hold for many Markov processes. It does, on the other hand, hold for a set of processes where the transition probabilities are symmetric (more generally, this holds iff the transitions are doubly-stochastic. Cover has a slick, few-line proof of this which relies on some properties of the KL-divergence).

In this case, the probability of going from state AA to state BB is the same as the probability of going from state BB to state AA. Writing this out mathematically, it says:

K(x,x)=K(x,x). K(x, x') = K(x', x).

I should note that this is a very strong condition, but it can be quite useful in giving a simple proof of the above law. To prove this, first note that the KL-divergence is nonnegative, since the negative log is convex, thus by Jensen's inequality (this is the same proof as the previous post):

D(ptpt)=xpt(x)logpt(x)pt(x)log(xpt(x)pt(x)pt(x))=log(xpt(x))=log1=0, \begin{aligned} D(p_t\lVert p_{t'})&=-\sum_x p_t(x) \log \frac{p_{t'}(x)}{p_t(x)} \\ &\ge -\log\left(\sum_x p_t(x)\frac{p_{t'}(x)}{p_t(x)}\right) \\ &= -\log\left(\sum_x p_{t'}(x)\right) \\ & = -\log 1 \\ &= 0, \end{aligned}

since pt,ptp_{t'}, p_t are probability distributions (i.e., nonnegative and sum to one).

Here is the magical trick to proving the above. Note that

D(ptpt+2)0, D(p_t \lVert p_{t+2}) \ge 0,

so

xpt(x)logpt+2(x)pt(x)0, -\sum_x p_t(x) \log \frac{p_{t+2}(x)}{p_t(x)} \ge 0,

which means

xpt(x)logpt+2(x)xpt(x)logpt(x). -\sum_x p_t(x) \log p_{t+2}(x) \ge -\sum_x p_t(x) \log p_t(x).

But, by definition, we have that

pt+2(x)=xK(x,x)pt+1(x), p_{t+2}(x) = \sum_{x'}K(x, x')p_{t+1}(x'),

so

xpt(x)log(xK(x,x)pt+1(x))xpt(x)logpt(x), -\sum_x p_t(x) \log \left(\sum_{x'} K(x, x')p_{t+1}(x')\right) \ge -\sum_x p_t(x) \log p_t(x),

but, by Jensen's inequality (again!) on the left hand side we get,

xpt(x)log(xK(x,x)pt+1(x))x,xpt(x)K(x,x)logpt+1(x). -\sum_x p_t(x) \log \left(\sum_{x'} K(x, x')p_{t+1}(x')\right) \le -\sum_{x, x'} p_t(x)K(x, x') \log p_{t+1}(x').

Since we know K(x,x)=K(x,x)K(x, x') = K(x', x), then we immediately have that

xpt(x)K(x,x)=pt+1(x), \sum_x p_t(x)K(x, x') = p_{t+1}(x'),

so, putting it all together

H(pt+1)=xpt+1(x)logpt+1(x)=x,xpt(x)K(x,x)logpt+1(x)xpt(x)logpt(x)=H(pt), \begin{aligned} H(p_{t+1}) &= -\sum_{x'} p_{t+1}(x') \log p_{t+1}(x') \\ &= -\sum_{x, x'} p_t(x)K(x, x') \log p_{t+1}(x')\\ &\ge -\sum_x p_t(x) \log p_t(x) \\ &= H(p_t), \end{aligned}

which is what we wished to prove.

More general Markov processes

So, while it turns out this law doesn't hold for general Markov processes, a very similar law does hold. If a Markov process has a stationary distribution, pp, then:

D(pt+1p)D(ptp), D(p_{t+1}\lVert p) \le D(p_t \lVert p),

so, as the Markov chain continues evolving, the KL divergence between the current distribution and the equilibrium distribution never decreases.

In fact, more generally, for any two initial probability distributions p0,q0p_0, q_0, we have that

D(pt+1qt+1)D(ptqt), D(p_{t+1} \lVert q_{t+1}) \le D(p_{t} \lVert q_{t}),

so any two distributions undergoing the same (Markovian) dynamics never decrease in KL-divergence! Even if the Markov process does not have a unique stationary distribution, there is still a type of second law which holds, in a very general sense.

As before, Cover has a fantastic, slick proof of the above, which I highly recommend you read!

Built with Franklin.jl and Julia.