Skip to main content

Notes on policy equivalence in adaptive Bayesian experimental design

·

Continuing from Bayesian experimental design and adaptive Bayesian experimental design, we can make precise why the nested adaptive criterion is equivalent to an optimal policy problem.

Consider 3 sequential design choices \(D_1,D_2,D_3\) and outcomes \(Y_1,Y_2,Y_3\). Let $$ u(d_1,d_2,d_3,y_1,y_2,y_3) = -S(\theta \mid y_{1:3},d_{1:3}), $$ so maximizing \(u\) is the same as minimizing posterior entropy.

The adaptive nested objective can be written as $$ V =\max_{d_1\in D_1} \mathbb E_{Y_1\mid d_1}\Big[ \max_{d_2\in D_2} \mathbb E_{Y_2\mid Y_1,d_1,d_2}\Big[ \max_{d_3\in D_3} \mathbb E_{Y_3\mid Y_1,Y_2,d_1,d_2,d_3} \big[u(d_1,d_2,d_3,Y_1,Y_2,Y_3)\big] \Big] \Big]. $$

This is not a myopic rule. The stage-1 choice is optimized for expected utility over all future observation paths, while explicitly anticipating that stages 2 and 3 will adapt optimally to whatever data are observed. Equivalently, we are solving for a policy that maps observations to decisions.

A deterministic non-anticipative policy is $$ \pi=(\pi_1,\pi_2,\pi_3) $$ with $$ \pi_1\in D_1,\qquad \pi_2:D_1\times \mathcal Y_1\to D_2,\qquad \pi_3:D_1\times \mathcal Y_1\times D_2\times \mathcal Y_2\to D_3. $$

Given \(\pi\), the policy value is $$ J(\pi)= \mathbb E\Big[ u\big( \pi_1, \pi_2(\pi_1,Y_1), \pi_3(\pi_1,Y_1,\pi_2(\pi_1,Y_1),Y_2), Y_1,Y_2,Y_3 \big) \Big]. $$ Let \(\Pi\) be the set of admissible policies. The key equivalence is $$ V = \max_{\pi\in\Pi} J(\pi). $$

The proof has two parts.

First, define the continuation values once: $$ G_3(d_1,y_1,d_2,y_2)= \max_{a_3\in D_3} \mathbb E\big[u(d_1,d_2,a_3,y_1,y_2,Y_3)\mid d_1,y_1,d_2,y_2,a_3\big], $$ $$ G_2(d_1,y_1)= \max_{a_2\in D_2} \mathbb E\big[G_3(d_1,y_1,a_2,Y_2)\mid d_1,y_1,a_2\big], $$ $$ G_1=\max_{a_1\in D_1}\mathbb E\big[G_2(a_1,Y_1)\mid a_1\big]. $$ By definition of the nested objective, \(G_1=V\).

Part 1: no policy can score higher than V

Fix any policy \(\pi\). For each realized history \(h_2=(d_1,y_1,d_2,y_2)\), define the stage-3 value of choosing action \(a_3\) as $$ q_3(a_3;h_2)=\mathbb E\big[u(d_1,d_2,a_3,y_1,y_2,Y_3)\mid h_2,a_3\big]. $$ The policy chooses \(a_3=\pi_3(h_2)\), so it gets \(q_3(\pi_3(h_2);h_2)\). Since this is one feasible action in \(D_3\), $$ q_3(\pi_3(h_2);h_2)\le \max_{a_3\in D_3}q_3(a_3;h_2)=G_3(d_1,y_1,d_2,y_2). $$

Now move one step earlier. For each \((d_1,y_1)\), define $$ q_2(a_2;d_1,y_1)=\mathbb E\big[G_3(d_1,y_1,a_2,Y_2)\mid y_1,d_1,a_2\big]. $$ Again, the policy uses one feasible action \(a_2=\pi_2(d_1,y_1)\), so $$ q_2(\pi_2(d_1,y_1);d_1,y_1)\le \max_{a_2\in D_2}q_2(a_2;d_1,y_1)=G_2(d_1,y_1). $$

At stage 1, define $$ q_1(a_1)=\mathbb E\big[G_2(a_1,Y_1)\mid a_1\big]. $$ Because \(\pi_1\) is one feasible action in \(D_1\), $$ q_1(\pi_1)\le \max_{a_1\in D_1}q_1(a_1)=G_1, $$ and from the definitions above, \(G_1=V\).

Putting these stage-wise inequalities together gives \(J(\pi)\le V\). Since \(\pi\) was arbitrary, $$ \max_{\pi\in\Pi}J(\pi)\le V. $$

Part 2: there exists a policy with value exactly V

Using those same continuation-value definitions, construct \(\pi^\star\) by choosing argmax actions at every history. Then \(\pi^\star\) makes exactly the same choices as the nested objective definition of \(V\). Evaluating \(J(\pi^\star)\) and applying iterated expectation from stage 3 back to stage 1 yields $$ J(\pi^\star)=G_1=V, $$ so $$ V\le \max_{\pi\in\Pi}J(\pi). $$

Combining the two parts yields $$ V=\max_{\pi\in\Pi}J(\pi). $$

For Bayesian experimental design this matters because it clarifies what is computed offline and online. The optimization is over decision rules, not over a single static sequence. Once an approximate optimal policy is available, online adaptation is cheap: new measurements are plugged into \(\pi_2\) and \(\pi_3\) without re-solving the full nested problem.

In practice, for continuous observations \(Y\), tabular dynamic programming is not available. A common approximation is to parameterize the policy with neural networks, for example \(d_2=\pi_{\phi,2}(d_1,y_1)\) and \(d_3=\pi_{\phi,3}(d_1,y_1,d_2,y_2)\), where \(\phi\) are network weights. Conceptually, the exact problem is optimization in function space (choose the best measurable mapping from histories to actions). After parameterization, that infinite-dimensional search is replaced by finite-dimensional nonlinear programming over \(\phi\). Then we simulate trajectories from the Bayesian model, estimate \(J(\pi_\phi)\) by Monte Carlo, and optimize \(\phi\) with policy gradient methods.