Hamilton–Jacobi–Bellman is just linear duality
Posted
This is, I suspect, a very well known result but it’s worth writing anyways. (It is certainly a “folk” theorem in the sense that I am familiar with in that, e.g., Stephen Boyd has mentioned something similar in spirit in at least one lecture I’ve attended.) This blogpost was inspired by Matt Lorig’s latest note on the Hamilton–Jacobi–Bellman equation, which is a fun little read if you’re interested in the stochastic case!
Ok, let’s get to it.
Set up
In our scenario, we have some system evolving through time . At each time , the system is in one of states, and, after observing this state, we can then take one of possible actions.1 In other words, there are possible state-action pairs. We will say a state-action pair is executed if the system is in the state represented by the state-action pair and we perform the action represented by the same pair.
It will be very convenient to define the “sum-over-actions” matrix , defined as, for and ,
So, if, for example, we have a vector over the state-action pairs, then is an -vector whose th entry denotes the sum over all actions for that particular state . In other words, if we think of as a probability distribution over state-action pairs, the matrix marginalizes out the actions, leaving the (marginal) distribution over states.
For each state-action pair and time , we have an associated transition matrix . This matrix will map a state-action pair (one of the possible ones) to a probability distribution over the states of the system. In particular, if the system is in some state and takes some action, such that the state-action pair is represented by the index with , then the th column of denotes the resulting probability distribution over states of the system after the state-action pair has been executed. (As an example, if the system has deterministic dynamics, then each column of is a single unit basis vector. The one nonzero entry of this column denotes the state that the system lands in after executing the state-action pair .)
Finally, at each time step , we will assign a loss at time . The th entry of this vector denotes the loss we eat from executing the state-action pair .
The natural question is: ok, we have this system that we can control, how do we choose our actions in order to minimize the total loss across all time steps? This is, of course, a control problem and we specify it next.
Control problem
From before, we would like to control the system such that the total loss is minimized, so let’s write that all as a simple optimization problem.
Assuming the system’s initial state is picked from some distribution , This is just a linear programming problem:
Here, the variables are the distributions over the state-action pairs for . This problem essentially writing down everything we just mentioned in the previous paragraphs, but let’s go through it step-by-step
First, the objective. At each time step , the expected probability that we execute state-action pair is given by . The loss incurred from doing that is , so take the inner product and sum that over all possible times to get the loss.
Let’s start with the last two constraints. Obviously since is a probability distribution, it should be nonnegative, which is just being enforced here for all times. On the other hand, we know that the initial state of the system (no matter what actions we take) will have probability distribution , by assumption. So, we must have that , as required.
Finally, the first constraint is simply encoding the dynamics we talked about in the previous section. The matrix maps the state-action pairs at time to the distribution over states at time . But the distribution over states at time is, by definition, the sum over all possible actions, for each state, which is simply . We enforce this for each time all the way to (since the final state is .)
We can solve this problem directly, of course, using just a standard linear programming solver and be done with life. But, like many things, the dual problem often has a lovely interpretation. (And, in some important cases, as in this one, it will be much easier to solve!)
As a side comment, note that we do not need the constraint that (i.e., that is normalized) since this is implied by the fact that , which means that its marginal is normalized by definition. (The rest of the time steps are normalized by definition of .)
Dual problem
To solve the dual problem, let’s introduce the following dual variables: will correspond to the second-to-last constraint, , while will correspond to the th constraint . (The constraint that will simply relax some equalities to inequalities, so I won’t include it here, but you can introduce additional multipliers for this, if it helps the bookkeeping.)
The resulting Lagrangian is, for ,
Rearranging gives
Finally, minimizing over gives the dual optimization problem
If you’re familiar with the HJB equation, well, the second constraint is pretty much it. If not, don’t worry, we’ll go through it in a second.
Analyzing the dual
As a matter of tradition, the vector is called the value function or cost-to-go at time . (We will refer to it as cost-to-go here.) This is a great observation from Bellman who discovered the fact that the original problem we had satisfies a dynamic programming principle, which, in turn, enables this value function to be easily constructed and later used to solve the original problem. (Though, of course, we discovered it here purely mechanically from just taking the dual problem.)
As a side note, the matrix has a nice interpretation we will use throughout. For each state , the value is one if state-action pair is over state and zero otherwise. In other words, the th row of is the Boolean vector consisting of all state-action pairs corresponding to state .
Ok, now onto the problem. First, note that, since is nonnegative, the objective is nondecreasing in all variables. Additionally, since is also nonnegative, then the right hand side of the constraint is nondecreasing in , as is the left hand side, since is also nonnegative. This means that, so long as we continue to satisfy the constraints of the problem, we can increase as much as we want and it will not decrease the objective value, for any . (In general, it might even increase it, which would be great.)
Let’s start from , the value function at the last time step.
What’s the maximum we can increase ? Well, we can increase as much as possible, until is equal to the lowest loss state-action pair containing state . Equivalently: over all possible actions we can take in state , we can make equal to the one with smallest loss. Writing it out:
This is true for all states and this is the largest possible we can hope to make .
Now, consider for . Assume we have (since we have , this is fine) then, using the same reasoning as before, we can take
The first term of the sum is the same as the previous, except over time . On the other hand can be interpreted as follows. For each state-action pair , the value is the expected cost-to-go if we were to execute . In particular, if we execute , then the th row of (equivalently, the th column of ) tells us the probability of landing in each of the possible states of the system, and is exactly the expectation of , the cost-to-go at the next time period, with respect to that distribution.
Finally, since the objective is nondecreasing and we have made as large as possible for all , then it must be the case that is maximized, so this choice of is optimal and we are finished.
Note that this equation is exactly the stochastic Bellman equation (which I’m cheating and calling the HJB equation) as required!
Recovering the solution
The final question is: ok, we have the value functions . How do we use these to solve for the original ?
Here’s a simple one, which I will leave to the reader to prove is a maximizing point. Let’s assume we have for now. Let be any state-action pair containing state for which
(That is, the state-action pair is a minimizer for state at time .) Set and set all other state-action pairs which contain state to zero. Do the same for all possible states and so on for all , then this will result in a feasible set of . Can you show this is optimal? (The simplest way is essentially the Bellman approach. The second simplest is via complementary slackness.)2
General thoughts
It is also possible to take some limits (under certain assumptions over the operators, etc) and recover the “true” HJB equation in the continuous limit, similar to the one that Matt posted. This is a little more annoying than the “standard” approach, but can be made fully rigorous, though I won’t do so here.
Overall, this is probably a slightly more complicated way to derive the construction than the standard Bellman approach. Of course, once you recognize that a value function of the form above can be constructed, the rest is just mechanical. What’s interesting about this approach, though, is that it is simply duality + basic observations.
Footnotes
-
In our setting, the number of states of the system can depend on the time and the actions that we are allowed to take can depend not just on the time but also on the actual state. For simplicity, we will ignore this, but the generalization is essentially immediate from the set up here, there’s just more indices. It is still informative to note that this “generalization” can be included in the setting we’re considering by expanding the set of states and actions. ↩
-
Another observation is that there may be many optimal for the original problem. In particular, if there are many state-action pairs for which is the minimizer at state , then any probability distribution over these minimizing state-action pairs will suffice. (In other words, for all minimizing pairs for state , the only constraint we need is that the sum of over the minimizing pairs of state is equal to .) ↩