Multiple Choice

In a policy optimization process, a penalty is used to measure the change between a current policy, πθ\pi_{\theta}, and a reference policy, πθref\pi_{\theta_{\text{ref}}}. The penalty is calculated for a specific sequence of actions and states (a trajectory, τ\tau) using the formula:

Penalty=logπθ(τ)logπθref(τ)\text{Penalty} = \log \pi_{\theta}(\tau) - \log \pi_{\theta_{\text{ref}}}(\tau)

If the calculated penalty for a particular trajectory is a large positive value, what is the most accurate interpretation?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science