Learn Before
Goal of Reinforcement Learning
The primary objective in reinforcement learning is to develop a policy that enables an agent to maximize the total cumulative reward, also known as the return, that it accumulates over an extended period of interaction with its environment.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Goal of Reinforcement Learning
Agent Performance Calculation
An agent interacts with an environment over a sequence of four time steps. The rewards it receives at each step are as follows: r₁ = +3, r₂ = -1, r₃ = +5, r₄ = -2. What is the total cumulative reward for this entire sequence?
Consider an agent that completes a five-step sequence of actions, receiving the following rewards at each step: [-5, +1, +1, +1, 0]. This sequence is preferable to another sequence that consists of a single step with a reward of -1.
Learn After
Objective Function as Expected Cumulative Reward (Performance Function)
An agent is being trained to find the best route through a system. It is presented with two options:
- Route 1: Provides a consistent, small positive reward at every step, resulting in a total reward of +15 for the entire route.
- Route 2: Starts with a step that gives a negative reward (a penalty) of -5, but subsequent steps lead to very high rewards, resulting in a total reward of +50 for the entire route.
An agent that has been successfully trained according to the primary objective of its learning framework will learn to choose Route 2. Which of the following statements best explains why?
Analysis of a Suboptimal Agent Policy
An agent is learning to play a game where the objective is to get the highest possible final score. At a critical decision point, the agent chooses an action that yields an immediate reward of 0, passing up an alternative action that would have given an immediate reward of +10. This decision is necessarily an indication that the agent's policy is flawed and not aligned with the primary goal of its learning framework.