Longest Path Search

Hamilton–Jacobi–Bellman is just linear duality

Posted

This is, I suspect, a very well known result but it’s worth writing anyways. (It is certainly a “folk” theorem in the sense that I am familiar with in that, e.g., Stephen Boyd has mentioned something similar in spirit in at least one lecture I’ve attended.) This blogpost was inspired by Matt Lorig’s latest note on the Hamilton–Jacobi–Bellman equation, which is a fun little read if you’re interested in the stochastic case!

Ok, let’s get to it.

Set up

In our scenario, we have some system evolving through time t=1,,T. At each time t, the system is in one of m states, and, after observing this state, we can then take one of n possible actions.1 In other words, there are mn possible state-action pairs. We will say a state-action pair is executed if the system is in the state represented by the state-action pair and we perform the action represented by the same pair.

It will be very convenient to define the “sum-over-actions” matrix S𝐑m×mn, defined as, for i=1,,m and j=1,,mn,

Sij={1state i is part of state-action pair j0otherwise.

So, if, for example, we have a vector y𝐑mn over the mn state-action pairs, then Sy is an m-vector whose ith entry denotes the sum over all actions for that particular state i. In other words, if we think of y as a probability distribution over state-action pairs, the matrix S marginalizes out the actions, leaving the (marginal) distribution over states.

For each state-action pair and time t=1,,T1, we have an associated transition matrix Mt𝐑m×mn. This matrix will map a state-action pair (one of the mn possible ones) to a probability distribution over the m states of the system. In particular, if the system is in some state and takes some action, such that the state-action pair is represented by the index j with 1jmn, then the jth column of Mt denotes the resulting probability distribution over states of the system after the state-action pair j has been executed. (As an example, if the system has deterministic dynamics, then each column of Mt is a single unit basis vector. The one nonzero entry of this column denotes the state that the system lands in after executing the state-action pair j.)

Finally, at each time step t=1,,T, we will assign a loss t𝐑mn at time t. The jth entry of this vector (t)j denotes the loss we eat from executing the state-action pair j=1,,mn.

The natural question is: ok, we have this system that we can control, how do we choose our actions in order to minimize the total loss across all time steps? This is, of course, a control problem and we specify it next.

Control problem

From before, we would like to control the system such that the total loss is minimized, so let’s write that all as a simple optimization problem.

Assuming the system’s initial state is picked from some distribution p1𝐑m, This is just a linear programming problem:

minimizet=1TxtTtsubject toSxt+1=Mtxt,t=1,,T1Sx1=p1,xt0,t=1,,T.

Here, the variables are the distributions xt𝐑mn over the state-action pairs for t=1,,T. This problem essentially writing down everything we just mentioned in the previous paragraphs, but let’s go through it step-by-step

First, the objective. At each time step t=1,,T, the expected probability that we execute state-action pair j is given by (xt)j. The loss incurred from doing that is (t)j, so take the inner product and sum that over all possible times t=1,,T to get the loss.

Let’s start with the last two constraints. Obviously since xt is a probability distribution, it should be nonnegative, which is just being enforced here for all times. On the other hand, we know that the initial state of the system (no matter what actions we take) will have probability distribution p1, by assumption. So, we must have that Sx1=p1, as required.

Finally, the first constraint is simply encoding the dynamics we talked about in the previous section. The matrix Mt maps the state-action pairs at time t to the distribution over states at time t+1. But the distribution over states at time t+1 is, by definition, the sum over all possible actions, for each state, which is simply Sxt. We enforce this for each time t=1 all the way to t=T1 (since the final state is xT.)

We can solve this problem directly, of course, using just a standard linear programming solver and be done with life. But, like many things, the dual problem often has a lovely interpretation. (And, in some important cases, as in this one, it will be much easier to solve!)

As a side comment, note that we do not need the constraint that 𝟏Txt=1 (i.e., that xt is normalized) since this is implied by the fact that Sx1=p1, which means that its marginal is normalized by definition. (The rest of the time steps t=2,,T are normalized by definition of Mt.)

Dual problem

To solve the dual problem, let’s introduce the following dual variables: ν1 will correspond to the second-to-last constraint, Sx1=p1, while νt+1 will correspond to the tth constraint Sxt+1=Mtxt. (The constraint that xt0 will simply relax some equalities to inequalities, so I won’t include it here, but you can introduce additional multipliers for this, if it helps the bookkeeping.)

The resulting Lagrangian is, for xt0,

L(x,ν)=t=1TtTxt+t=1T1νt+1T(MtxtSxt+1)+ν1T(p1Sx1).

Rearranging gives

L(x,ν)=t=1T1(t+MtTνt+1STνt)Txt+(TSTνT)TxT+p1Tν1.

Finally, minimizing over xt0 gives the dual optimization problem

maximizep1Tν1subject toSTνtt+MtTνt+1,t=1,,T1STνTT.

If you’re familiar with the HJB equation, well, the second constraint is pretty much it. If not, don’t worry, we’ll go through it in a second.

Analyzing the dual

As a matter of tradition, the vector νt is called the value function or cost-to-go at time t. (We will refer to it as cost-to-go here.) This is a great observation from Bellman who discovered the fact that the original problem we had satisfies a dynamic programming principle, which, in turn, enables this value function to be easily constructed and later used to solve the original problem. (Though, of course, we discovered it here purely mechanically from just taking the dual problem.)

As a side note, the matrix ST has a nice interpretation we will use throughout. For each state i, the value (ST)ji is one if state-action pair is over state i and zero otherwise. In other words, the ith row of ST is the Boolean vector consisting of all state-action pairs corresponding to state i.

Ok, now onto the problem. First, note that, since p10 is nonnegative, the objective is nondecreasing in all variables. Additionally, since Mt0 is also nonnegative, then the right hand side of the constraint is nondecreasing in νt+1, as is the left hand side, since S is also nonnegative. This means that, so long as we continue to satisfy the constraints of the problem, we can increase νt as much as we want and it will not decrease the objective value, for any t. (In general, it might even increase it, which would be great.)

Let’s start from νT, the value function at the last time step.

What’s the maximum we can increase νT? Well, we can increase (νT)i as much as possible, until (νT)i is equal to the lowest loss state-action pair j containing state i. Equivalently: over all possible actions we can take in state i, we can make (νT)i equal to the one with smallest loss. Writing it out:

(νT)i=minj contains i(T)j.

This is true for all states i=1,,m and this is the largest possible we can hope to make νT.

Now, consider νt for t=1,,T1. Assume we have νt+1 (since we have νT, this is fine) then, using the same reasoning as before, we can take

(νt)i=minj contains i(t+MtTνt+1)j.

The first term of the sum is the same as the previous, except over time t. On the other hand MtTνt+1 can be interpreted as follows. For each state-action pair j, the value (MtTνt+1)j is the expected cost-to-go if we were to execute j. In particular, if we execute j, then the jth row of MtT (equivalently, the jth column of Mt) tells us the probability of landing in each of the possible m states of the system, and (MtTνt+1)j is exactly the expectation of νt+1, the cost-to-go at the next time period, with respect to that distribution.

Finally, since the objective is nondecreasing and we have made νt as large as possible for all t, then it must be the case that p1Tν1 is maximized, so this choice of νt is optimal and we are finished.

Note that this equation is exactly the stochastic Bellman equation (which I’m cheating and calling the HJB equation) as required!

Recovering the solution

The final question is: ok, we have the value functions νt. How do we use these to solve for the original xt?

Here’s a simple one, which I will leave to the reader to prove is a maximizing point. Let’s assume we have xt for now. Let j be any state-action pair containing state i for which

(νt)i=(t+MtTνt+1)j.

(That is, the state-action pair j is a minimizer for state i at time t.) Set (xt+1)j=(Mtxt)i and set all other state-action pairs which contain state i to zero. Do the same for all possible states i and so on for all t=1,dots,T, then this will result in a feasible set of xt. Can you show this is optimal? (The simplest way is essentially the Bellman approach. The second simplest is via complementary slackness.)2

General thoughts

It is also possible to take some limits (under certain assumptions over the operators, etc) and recover the “true” HJB equation in the continuous limit, similar to the one that Matt posted. This is a little more annoying than the “standard” approach, but can be made fully rigorous, though I won’t do so here.

Overall, this is probably a slightly more complicated way to derive the construction than the standard Bellman approach. Of course, once you recognize that a value function of the form above can be constructed, the rest is just mechanical. What’s interesting about this approach, though, is that it is simply duality + basic observations.


Footnotes

  1. In our setting, the number of states of the system can depend on the time t and the actions that we are allowed to take can depend not just on the time but also on the actual state. For simplicity, we will ignore this, but the generalization is essentially immediate from the set up here, there’s just more indices. It is still informative to note that this “generalization” can be included in the setting we’re considering by expanding the set of states and actions.

  2. Another observation is that there may be many optimal xt for the original problem. In particular, if there are many state-action pairs j for which (νt)i is the minimizer at state i, then any probability distribution over these minimizing state-action pairs j will suffice. (In other words, for all minimizing pairs j for state i, the only constraint we need is that the sum of (xt)j over the minimizing pairs j of state i is equal to (Mtxt)i.)