Case Study

Diagnosing Trajectory Errors in a Reinforcement Learning System

Case context: You are developing a reinforcement learning system where the goal is to find a good trajectory T. The system uses a learned reward function Score(T) = R(T) as its approximate scoring function, and a reinforcement learning algorithm to search for and execute a trajectory that maximizes this reward. During testing, the system outputs a trajectory that is highly suboptimal.

Question: Based on the approximate scoring function plus approximate maximization pattern, how should you structure an analysis to determine whether the suboptimal trajectory is a failure of the reward function or a failure of the RL search algorithm?

Sample answer: To diagnose the issue, you should apply the Optimization Verification test by comparing the score of the optimal/correct trajectory (T*) with the score of the system's output trajectory (T_out). If the learned reward function scores the suboptimal trajectory higher than the optimal one (R(T_out) > R(T*)), the error lies in the scoring function (the reward function did not accurately capture what is optimal). If the reward function correctly scores the optimal trajectory higher (R(T*) > R(T_out)), then the scoring function is fine, but the reinforcement learning algorithm failed to find that higher-scoring trajectory, indicating an optimization/search failure.

Key points:

  • Identify the reward function R(T) as the approximate scoring function.
  • Identify the RL algorithm as the approximate maximization algorithm.
  • Compare the score of the optimal trajectory T* against the output trajectory T_out.
  • Attribute the error to the scoring function if the suboptimal output scores higher (R(T_out) > R(T*)).
  • Attribute the error to the maximization algorithm if the optimal trajectory scores higher but was not selected (R(T*) > R(T_out)).

Rubric: The learner must state that they will compare the scoring/reward of the optimal trajectory versus the system-generated trajectory. They must correctly identify that R(T_out) > R(T*) means the scoring function is at fault, and R(T*) > R(T_out) means the maximization/search algorithm is at fault.

0

1

Updated 2026-05-26

Contributors are:

Who are from:

Tags

Data Science

Machine Learning

Deep Learning

Supervised Learning

Dive into Deep Learning @ D2L

Machine Learning Strategy

Related