Hamilton–Jacobi–Bellman is just linear duality

Posted 2025-09-15

This is, I suspect, a very well known result but it’s worth writing anyways. (It is certainly a “folk” theorem in the sense that I am familiar with in that, e.g., Stephen Boyd has mentioned something similar in spirit in at least one lecture I’ve attended.) This blogpost was inspired by Matt Lorig’s latest note on the Hamilton–Jacobi–Bellman equation, which is a fun little read if you’re interested in the stochastic case!

Ok, let’s get to it.

Set up

In our scenario, we have some system evolving through time $t = 1, \dots, T$ . At each time $t$ , the system is in one of $m$ states, and, after observing this state, we can then take one of $n$ possible actions.¹ In other words, there are $m n$ possible state-action pairs. We will say a state-action pair is executed if the system is in the state represented by the state-action pair and we perform the action represented by the same pair.

It will be very convenient to define the “sum-over-actions” matrix $S \in 𝐑^{m \times m n}$ , defined as, for $i = 1, \dots, m$ and $j = 1, \dots, m n$ ,

S_{i j} = {\begin{matrix} 1 & state i is part of state-action pair j \\ 0 & otherwise . \end{matrix}

So, if, for example, we have a vector $y \in 𝐑^{m n}$ over the $m n$ state-action pairs, then $S y$ is an $m$ -vector whose $i$ th entry denotes the sum over all actions for that particular state $i$ . In other words, if we think of $y$ as a probability distribution over state-action pairs, the matrix $S$ marginalizes out the actions, leaving the (marginal) distribution over states.

For each state-action pair and time $t = 1, \dots, T - 1$ , we have an associated transition matrix $M_{t} \in 𝐑^{m \times m n}$ . This matrix will map a state-action pair (one of the $m n$ possible ones) to a probability distribution over the $m$ states of the system. In particular, if the system is in some state and takes some action, such that the state-action pair is represented by the index $j$ with $1 \leq j \leq m n$ , then the $j$ th column of $M_{t}$ denotes the resulting probability distribution over states of the system after the state-action pair $j$ has been executed. (As an example, if the system has deterministic dynamics, then each column of $M_{t}$ is a single unit basis vector. The one nonzero entry of this column denotes the state that the system lands in after executing the state-action pair $j$ .)

Finally, at each time step $t = 1, \dots, T$ , we will assign a loss $ℓ_{t} \in 𝐑^{m n}$ at time $t$ . The $j$ th entry of this vector $(ℓ_{t})_{j}$ denotes the loss we eat from executing the state-action pair $j = 1, \dots, m n$ .

The natural question is: ok, we have this system that we can control, how do we choose our actions in order to minimize the total loss across all time steps? This is, of course, a control problem and we specify it next.

Control problem

From before, we would like to control the system such that the total loss is minimized, so let’s write that all as a simple optimization problem.

Assuming the system’s initial state is picked from some distribution $p_{1} \in 𝐑^{m}$ , This is just a linear programming problem:

\begin{matrix} minimize & \sum_{t = 1}^{T} x_{t}^{T} ℓ_{t} \\ subject to & S x_{t + 1} = M_{t} x_{t}, t = 1, \dots, T - 1 \\ S x_{1} = p_{1}, x_{t} \geq 0, t = 1, \dots, T . \end{matrix}

Here, the variables are the distributions $x_{t} \in 𝐑^{m n}$ over the state-action pairs for $t = 1, \dots, T$ . This problem essentially writing down everything we just mentioned in the previous paragraphs, but let’s go through it step-by-step

First, the objective. At each time step $t = 1, \dots, T$ , the expected probability that we execute state-action pair $j$ is given by $(x_{t})_{j}$ . The loss incurred from doing that is $(ℓ_{t})_{j}$ , so take the inner product and sum that over all possible times $t = 1, \dots, T$ to get the loss.

Let’s start with the last two constraints. Obviously since $x_{t}$ is a probability distribution, it should be nonnegative, which is just being enforced here for all times $t$ . On the other hand, we know that the initial state of the system (no matter what actions we take) will have probability distribution $p_{1}$ , by assumption. So, we must have that $S x_{1} = p_{1}$ , as required.

Finally, the first constraint is simply encoding the dynamics we talked about in the previous section. The matrix $M_{t}$ maps the state-action pairs at time $t$ to the distribution over states at time $t + 1$ . But the distribution over states at time $t + 1$ is, by definition, the sum over all possible actions, for each state, which is simply $S x_{t}$ . We enforce this for each time $t = 1$ all the way to $t = T - 1$ (since the final state is $x_{T}$ .)

We can solve this problem directly, of course, using just a standard linear programming solver and be done with life. But, like many things, the dual problem often has a lovely interpretation. (And, in some important cases, as in this one, it will be much easier to solve!)

As a side comment, note that we do not need the constraint that $𝟏^{T} x_{t} = 1$ (i.e., that $x_{t}$ is normalized) since this is implied by the fact that $S x_{1} = p_{1}$ , which means that its marginal is normalized by definition. (The rest of the time steps $t = 2, \dots, T$ are normalized by definition of $M_{t}$ .)

Dual problem

To solve the dual problem, let’s introduce the following dual variables: $ν_{1}$ will correspond to the second-to-last constraint, $S x_{1} = p_{1}$ , while $ν_{t + 1}$ will correspond to the $t$ th constraint $S x_{t + 1} = M_{t} x_{t}$ . (The constraint that $x_{t} \geq 0$ will simply relax some equalities to inequalities, so I won’t include it here, but you can introduce additional multipliers for this, if it helps the bookkeeping.)

The resulting Lagrangian is, for $x_{t} \geq 0$ ,

L (x, ν) = \sum_{t = 1}^{T} ℓ_{t}^{T} x_{t} + \sum_{t = 1}^{T - 1} ν_{t + 1}^{T} (M_{t} x_{t} - S x_{t + 1}) + ν_{1}^{T} (p_{1} - S x_{1}) .

Rearranging gives

L (x, ν) = \sum_{t = 1}^{T - 1} (ℓ_{t} + M_{t}^{T} ν_{t + 1} - S^{T} ν_{t})^{T} x_{t} + (ℓ_{T} - S^{T} ν_{T})^{T} x_{T} + p_{1}^{T} ν_{1} .

Finally, minimizing over $x_{t} \geq 0$ gives the dual optimization problem

\begin{matrix} maximize & p_{1}^{T} ν_{1} \\ subject to & S^{T} ν_{t} \leq ℓ_{t} + M_{t}^{T} ν_{t + 1}, t = 1, \dots, T - 1 \\ S^{T} ν_{T} \leq ℓ_{T} . \end{matrix}

If you’re familiar with the HJB equation, well, the second constraint is pretty much it. If not, don’t worry, we’ll go through it in a second.

Analyzing the dual

As a matter of tradition, the vector $ν_{t}$ is called the value function or cost-to-go at time $t$ . (We will refer to it as cost-to-go here.) This is a great observation from Bellman who discovered the fact that the original problem we had satisfies a dynamic programming principle, which, in turn, enables this value function to be easily constructed and later used to solve the original problem. (Though, of course, we discovered it here purely mechanically from just taking the dual problem.)

As a side note, the matrix $S^{T}$ has a nice interpretation we will use throughout. For each state $i$ , the value $(S^{T})_{j i}$ is one if state-action pair is over state $i$ and zero otherwise. In other words, the $i$ th row of $S^{T}$ is the Boolean vector consisting of all state-action pairs corresponding to state $i$ .

Ok, now onto the problem. First, note that, since $p_{1} \geq 0$ is nonnegative, the objective is nondecreasing in all variables. Additionally, since $M_{t} \geq 0$ is also nonnegative, then the right hand side of the constraint is nondecreasing in $ν_{t + 1}$ , as is the left hand side, since $S$ is also nonnegative. This means that, so long as we continue to satisfy the constraints of the problem, we can increase $ν_{t}$ as much as we want and it will not decrease the objective value, for any $t$ . (In general, it might even increase it, which would be great.)

Let’s start from $ν_{T}$ , the value function at the last time step.

What’s the maximum we can increase $ν_{T}$ ? Well, we can increase $(ν_{T})_{i}$ as much as possible, until $(ν_{T})_{i}$ is equal to the lowest loss state-action pair $j$ containing state $i$ . Equivalently: over all possible actions we can take in state $i$ , we can make $(ν_{T})_{i}$ equal to the one with smallest loss. Writing it out:

(ν_{T})_{i} = \min_{j contains i} (ℓ_{T})_{j} .

This is true for all states $i = 1, \dots, m$ and this is the largest possible we can hope to make $ν_{T}$ .

Now, consider $ν_{t}$ for $t = 1, \dots, T - 1$ . Assume we have $ν_{t + 1}$ (since we have $ν_{T}$ , this is fine) then, using the same reasoning as before, we can take

(ν_{t})_{i} = \min_{j contains i} (ℓ_{t} + M_{t}^{T} ν_{t + 1})_{j} .

The first term of the sum is the same as the previous, except over time $t$ . On the other hand $M_{t}^{T} ν_{t + 1}$ can be interpreted as follows. For each state-action pair $j$ , the value $(M_{t}^{T} ν_{t + 1})_{j}$ is the expected cost-to-go if we were to execute $j$ . In particular, if we execute $j$ , then the $j$ th row of $M_{t}^{T}$ (equivalently, the $j$ th column of $M_{t}$ ) tells us the probability of landing in each of the possible $m$ states of the system, and $(M_{t}^{T} ν_{t + 1})_{j}$ is exactly the expectation of $ν_{t + 1}$ , the cost-to-go at the next time period, with respect to that distribution.

Finally, since the objective is nondecreasing and we have made $ν_{t}$ as large as possible for all $t$ , then it must be the case that $p_{1}^{T} ν_{1}$ is maximized, so this choice of $ν_{t}$ is optimal and we are finished.

Note that this equation is exactly the stochastic Bellman equation (which I’m cheating and calling the HJB equation) as required!

Recovering the solution

The final question is: ok, we have the value functions $ν_{t}$ . How do we use these to solve for the original $x_{t}$ ?

Here’s a simple one, which I will leave to the reader to prove is a maximizing point. Let’s assume we have $x_{t}$ for now. Let $j$ be any state-action pair containing state $i$ for which

(ν_{t})_{i} = (ℓ_{t} + M_{t}^{T} ν_{t + 1})_{j} .

(That is, the state-action pair $j$ is a minimizer for state $i$ at time $t$ .) Set $(x_{t + 1})_{j} = (M_{t} x_{t})_{i}$ and set all other state-action pairs which contain state $i$ to zero. Do the same for all possible states $i$ and so on for all $t = 1, \dots, T$ , then this will result in a feasible set of $x_{t}$ . Can you show this is optimal? (The simplest way is essentially the Bellman approach. The second simplest is via complementary slackness.)²

General thoughts

It is also possible to take some limits (under certain assumptions over the operators, etc) and recover the “true” HJB equation in the continuous limit, similar to the one that Matt posted. This is a little more annoying than the “standard” approach, but can be made fully rigorous, though I won’t do so here.

Overall, this is probably a slightly more complicated way to derive the construction than the standard Bellman approach. Of course, once you recognize that a value function of the form above can be constructed, the rest is just mechanical. What’s interesting about this particular approach, though, is that it is simply duality + basic observations.

In our setting, the number of states of the system can depend on the time $t$ and the actions that we are allowed to take can depend not just on the time but also on the actual state. For simplicity, we will ignore this, but the generalization is essentially immediate from the set up here, there’s just more indices. It is still informative to note that this “generalization” can be included in the setting we’re considering by expanding the set of states and actions. ↩
Another observation is that there may be many optimal $x_{t}$ for the original problem. In particular, if there are many state-action pairs $j$ for which $(ν_{t})_{i}$ is the minimizer at state $i$ , then any probability distribution over these minimizing state-action pairs $j$ will suffice. (In other words, for all minimizing pairs $j$ for state $i$ , the only constraint we need is that the sum of $(x_{t})_{j}$ over the minimizing pairs $j$ of state $i$ is equal to $(M_{t} x_{t})_{i}$ .) ↩