Case Study

Calculating Trajectory Utility with Importance Sampling

An agent's policy is being updated. The utility of a trajectory is calculated by re-weighting the advantage of each action. The formula for the utility U of a two-step trajectory is: U=(πθ(a1s1)πθref(a1s1)A(s1,a1))+(πθ(a2s2)πθref(a2s2)A(s2,a2))U = \left( \frac{\pi_{\theta}(a_1|s_1)}{\pi_{\theta_{\text{ref}}}(a_1|s_1)} \cdot A(s_1, a_1) \right) + \left( \frac{\pi_{\theta}(a_2|s_2)}{\pi_{\theta_{\text{ref}}}(a_2|s_2)} \cdot A(s_2, a_2) \right) Based on the data in the case study below, calculate the total utility U for this trajectory and explain what the final value implies for the policy update.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science