1Cademy - Analysis of a Suboptimal Agent Policy

Learn Before

Goal of Reinforcement Learning

Case Study

Analysis of a Suboptimal Agent Policy

Based on the fundamental goal of reinforcement learning, analyze the agent's final behavior. Why is this behavior considered suboptimal, and what does it suggest about the agent's learning process in relation to its objective?

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Objective Function as Expected Cumulative Reward (Performance Function)
An agent is being trained to find the best route through a system. It is presented with two options:
- Route 1: Provides a consistent, small positive reward at every step, resulting in a total reward of +15 for the entire route.
- Route 2: Starts with a step that gives a negative reward (a penalty) of -5, but subsequent steps lead to very high rewards, resulting in a total reward of +50 for the entire route.
An agent that has been successfully trained according to the primary objective of its learning framework will learn to choose Route 2. Which of the following statements best explains why?
Analysis of a Suboptimal Agent Policy
An agent is learning to play a game where the objective is to get the highest possible final score. At a critical decision point, the agent chooses an action that yields an immediate reward of 0, passing up an alternative action that would have given an immediate reward of +10. This decision is necessarily an indication that the agent's policy is flawed and not aligned with the primary goal of its learning framework.

Learn Before

Related