Training Objective as Maximization of the Performance Function
In reinforcement learning, the primary goal of the training process is to find the optimal set of policy parameters, denoted by , that maximizes the objective or performance function, . This optimization aims to enhance the policy in a way that yields the highest possible expected cumulative reward. Formally, the optimal parameters are determined by the equation:
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Training Objective as Maximization of the Performance Function
Derivation of the Policy Gradient Objective Function
Off-Policy Objective Function with Importance Sampling
An agent is operating under a policy parameterized by . This policy can result in one of two possible trajectories. Trajectory A has a total reward of 20 and a 70% probability of occurring. Trajectory B has a total reward of -10 and a 30% probability of occurring. Given that the performance of a policy is measured by the expected cumulative reward over all possible trajectories (), what is the value of the performance function for this policy?
Critique of the Expected Reward Objective
On-Policy Objective Function (Performance Measure)
Policy Performance Comparison
Learn After
Optimal Policy Parameters via Maximization Formula
An engineer is training a system using a reinforcement learning approach. The system's behavior is determined by a set of adjustable parameters. The training process aims to find the parameter values that maximize a specific 'performance function,' which represents the expected cumulative reward. The engineer runs two separate training procedures, Procedure X and Procedure Y, and observes the following final outcomes:
- Procedure X: The final set of parameters results in a performance function value of 150.
- Procedure Y: The final set of parameters results in a performance function value of 125. However, Procedure Y completed in half the time of Procedure X.
Which statement best evaluates the outcomes in relation to the primary training objective?
Evaluating Policy Effectiveness
Identifying Optimal Policy Parameters from Training Data
Basic Policy Gradient Approach