Learn Before
Surrogate Objective at the Policy Reference Point
When the current policy parameters are identical to the reference policy parameters, a condition denoted by , the standard importance-sampled surrogate objective simplifies. The importance sampling ratio becomes one, causing the surrogate objective's value to equal the expected reward of the reference policy: In this specific context, the term 'surrogate objective' may refer to this simplified expression, which is equivalent to the true on-policy objective at this point.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Equivalence of the Surrogate Objective and the On-Policy Objective
Surrogate Objective at the Policy Reference Point
Equivalence of Surrogate and On-Policy Gradients at the Reference Point
Training a Policy with Off-Distribution Data
A reinforcement learning agent is being updated. The current policy is denoted by , and a batch of trajectory data has been collected using a previous, fixed policy, . To improve the current policy using this existing data, the following objective function is optimized: . Which statement best analyzes the role of this objective function in the training process?
Rationale for Using a Surrogate Objective
Separation of Sampling and Reward Computation in Policy Learning
Variance in Surrogate Objective Gradient Estimates
Clipped Surrogate Objective Function
Learn After
In a reinforcement learning process, a new policy defined by parameters θ is evaluated using an objective function that relies on data from a reference policy with parameters θ_ref. The objective function is:
J(θ) = E_{τ ~ π_{θ_ref}} [ (Pr_θ(τ) / Pr_{θ_ref}(τ)) * R(τ) ]
Where τ is a trajectory, Pr(τ) is the probability of that trajectory, R(τ) is its total reward, and E_{τ ~ π_{θ_ref}} denotes the expected value over trajectories from the reference policy.
What does this objective function J(θ) simplify to at the specific point where the new policy is identical to the reference policy (i.e., θ = θ_ref)?
Reasoning for Objective Simplification
In a reinforcement learning scenario, the performance of a new policy, defined by parameters θ, is often estimated using an objective function that relies on data collected from a reference policy, defined by parameters θ_ref. This objective function is given by: where τ represents a trajectory, Pr(τ) is the probability of that trajectory, and R(τ) is its total reward. Which of the following statements most accurately evaluates the relationship between this objective function, J(θ), and the true expected reward of the reference policy, ?