Applying Optimization Verification in RL

Diagnosing Reward Function Issues

Interpreting _____ in Optimization Verification

Matching Optimization Verification Components

Executing the Optimization Verification Test

Question:
Explain the significance of the inequality R(Thuman) > R(Tout) in the context of the Optimization Verification test. What does it tell you about the reward function and the reinforcement learning algorithm?

Sample answer:
When the inequality R(Thuman) > R(Tout) holds, it means that the reward function correctly assigns a higher score to the superior human trajectory compared to the inferior trajectory generated by the algorithm. This signifies that the reward function is functioning as intended, accurately reflecting the desired tradeoff or behavior. Consequently, the fault lies with the reinforcement learning algorithm, which is failing to maximize the reward and is instead settling for an inferior trajectory. The next step is to improve the learning algorithm's optimization process.

Key points:
- Validates the reward function
- Identifies the learning algorithm as the problem
- Indicates the algorithm is failing to maximize the reward

Rubric:
A strong answer should clearly state that the inequality validates the reward function and identifies the learning algorithm as the source of the poor performance.

Analyzing the R(Thuman) > R(Tout) Inequality

Case context:
You are building an RL agent to land a simulated helicopter. You define a reward function to balance landing accuracy and ride smoothness. The agent learns to land, but its trajectory (Tout) is much bumpier than a human pilot's trajectory (Thuman). You apply the Optimization Verification test and find that R(Thuman) is less than R(Tout).

Question:
Based on this finding, what specific component of your system should you focus on improving, and why?

Sample answer:
You should focus on improving the reward function. Because R(Thuman) is less than R(Tout), the reward function is incorrectly assigning a higher score to the bumpier, inferior algorithm trajectory than to the smoother, superior human trajectory. This means the reward function fails to specify the ideal tradeoff between ride bumpiness and landing accuracy, so it must be redesigned.

Key points:
- Improve the reward function
- The function assigns higher reward to the inferior trajectory
- The tradeoff specification is flawed

Rubric:
The answer must identify the reward function as the component to improve and explain that it is incorrectly scoring the trajectories.

Helicopter Landing Reinforcement Learning

Question:
According to the Optimization Verification test, under what specific condition should you deduce that your reinforcement learning algorithm (and not the reward function) needs improvement?

Sample answer:
You should improve the reinforcement learning algorithm if the reward assigned to the superior human trajectory is strictly greater than the reward assigned to the algorithm's inferior trajectory.

Key points:
- Compare human and algorithm trajectories
- Human trajectory scores higher
- Indicates algorithm fails to optimize

Rubric:
The response must mention comparing the human and algorithm trajectories and checking if the human trajectory scores higher.

When to Improve the RL Algorithm

Interpreting R(Thuman) vs R(Tout)

Purpose of Optimization Verification

For a reinforcement learning system whose trajectory is worse than a human pilot trajectory, compare the reward assigned to the human trajectory with the reward assigned to the algorithm trajectory. If the human trajectory scores higher, improving the reinforcement learning algorithm is worthwhile; if it does not, improve the reward function.

Google

The Optimization Verification test compares the score of a known correct output with the score of the system output. If the correct output scores higher, blame the optimization or search algorithm; otherwise, blame the scoring function computation.

Optimization Verification Test

Machine Learning Yearning, a free ebook from Andrew Ng, teaches you how to structure Machine Learning projects.
https://www.deeplearning.ai/machine-learning-yearning/

Note: The content of the book is aligned with the Coursera Deeplearning.ai specialization.  https://www.deeplearning.ai/deep-learning-specialization/ 

Machine Learning Yearning (Deeplearning.ai)

In practice, apply the Optimization Verification test across errors in the dev set. Each error where the correct output scores higher is marked as an optimization-algorithm error; each error where it does not is counted as a scoring-function error.

Applying Optimization Verification Across Dev Set Errors

A common AI design pattern is to first learn an approximate scoring function and then use an approximate maximization algorithm. Recognizing this pattern lets one use Optimization Verification to understand the source of errors.

Approximate Scoring Function Plus Approximate Maximization Pattern

For machine translation, compute a score for possible translations and use heuristic search because the set of possible English sentences is too large. If the correct translation scores above the system translation, attribute the error to approximate search; otherwise, attribute it to the score computation.

Optimization Verification for Machine Translation

Optimization Verification for Reinforcement Learning Reward Functions

Optimization Verification does not require a truly optimal output. If a superior output is available, such as a human pilot trajectory that is better than the current learning algorithm output, the test can still indicate whether improving the optimization algorithm or the scoring function is more promising.

Using a Superior Human Output in Optimization Verification

Diagnosing Errors with Optimization Verification

To apply the Optimization Verification test for a given input $x$, you must know how to compute a _____ that indicates how good a response $y$ is to that input.

Optimization Verification Variables

Steps to Perform Optimization Verification

Question:
Suppose your machine learning system outputs $y_{out}$ instead of the correct response $y^*$. Describe how you would use the Optimization Verification test to diagnose the problem, and explain the two possible outcomes and what they mean for your system's components.

Sample answer:
To use the Optimization Verification test, I would compute the score for the correct response, $Score_x(y^*)$, and the score for the system's output, $Score_x(y_{out})$. Then, I would compare the two scores. If $Score_x(y^*) > Score_x(y_{out})$, it means the scoring function correctly assigned a higher score to the right answer, but the search/optimization algorithm failed to find it. In this case, I would blame the optimization algorithm. If $Score_x(y^*) \le Score_x(y_{out})$, it means the scoring function preferred the incorrect answer over the correct one, even though the correct one was available. In this case, I would blame the scoring function computation.

Key points:
- Compute $Score_x(y^*)$ and $Score_x(y_{out})$.
- Compare the two scores.
- If $Score_x(y^*) > Score_x(y_{out})$, blame the optimization or search algorithm.
- If $Score_x(y^*) \le Score_x(y_{out})$, blame the scoring function computation.

Rubric:
A good response will describe the process of computing and comparing the two scores and accurately interpret both possible outcomes of the inequality.

Interpreting the Test Outcomes

Case context:
You are developing a speech recognition system. For a specific audio clip, the correct transcription is "I love machine learning" ($S^*$), but your system outputs "I love robots" ($S_{out}$). Your system uses a scoring function $Score_A$ and an approximate search algorithm to find the transcription with the highest score.

Question:
How would you use the Optimization Verification test to diagnose why the system output "I love robots"? What specific measurements would you take, and how would you interpret the results to decide whether to fix the search algorithm or the scoring function?

Sample answer:
I would compute $Score_A("I love machine learning")$ and $Score_A("I love robots")$. Then, I would check if $Score_A("I love machine learning") > Score_A("I love robots")$. If this inequality is true, it means the scoring function is working properly (it prefers the correct answer), so I should fix the approximate search algorithm because it failed to find the transcription with the highest score. If the inequality is false, it means the scoring function erroneously assigned a higher (or equal) score to the incorrect transcription, so I should fix the scoring function computation.

Key points:
- Compute $Score_A(S^*)$ and $Score_A(S_{out})$.
- Check whether $Score_A(S^*) > Score_A(S_{out})$.
- Blame the search algorithm if the inequality holds.
- Blame the scoring function if the inequality does not hold.

Rubric:
A strong answer will explicitly mention computing the scores for the two specific transcriptions provided and correctly explain how to map the inequality to the component at fault.

Speech Recognition Debugging

Question:
Under what specific mathematical condition during the Optimization Verification test do we blame the optimization algorithm for a system's mistake?

Sample answer:
We blame the optimization algorithm when the score of the correct output is strictly greater than the score of the system's output: $Score_x(y^*) > Score_x(y_{out})$.

Key points:
- The score of the correct output ($y^*$ or $S^*$) must be compared to the actual output ($y_{out}$ or $S_{out}$).
- The condition is $Score_x(y^*) > Score_x(y_{out})$.

Rubric:
The answer must identify the condition where the correct output's score is higher than the actual output's score.

Learn Before

Related

Learn After