Learn Before
An agent is being trained to find the best route through a system. It is presented with two options:
- Route 1: Provides a consistent, small positive reward at every step, resulting in a total reward of +15 for the entire route.
- Route 2: Starts with a step that gives a negative reward (a penalty) of -5, but subsequent steps lead to very high rewards, resulting in a total reward of +50 for the entire route.
An agent that has been successfully trained according to the primary objective of its learning framework will learn to choose Route 2. Which of the following statements best explains why?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Objective Function as Expected Cumulative Reward (Performance Function)
An agent is being trained to find the best route through a system. It is presented with two options:
- Route 1: Provides a consistent, small positive reward at every step, resulting in a total reward of +15 for the entire route.
- Route 2: Starts with a step that gives a negative reward (a penalty) of -5, but subsequent steps lead to very high rewards, resulting in a total reward of +50 for the entire route.
An agent that has been successfully trained according to the primary objective of its learning framework will learn to choose Route 2. Which of the following statements best explains why?
Analysis of a Suboptimal Agent Policy
An agent is learning to play a game where the objective is to get the highest possible final score. At a critical decision point, the agent chooses an action that yields an immediate reward of 0, passing up an alternative action that would have given an immediate reward of +10. This decision is necessarily an indication that the agent's policy is flawed and not aligned with the primary goal of its learning framework.