Learn Before
Return
Our final goal is to choose actions over time so that we could maximize the expected value of the return. The definition of return is as below: The return is the total discounted reward from time-step t. , where γ is the discounted factor. When γ close to 0 leads to “myopic” evaluation; γ close to 1 leads to “far-sighted” evaluation.
0
1
Contributors are:
Who are from:
Tags
Data Science
Foundations of Large Language Models Course
Computing Sciences
Learn After
An autonomous agent is being trained to navigate a grid. From its current position, it can choose one of two paths. Path A leads to an immediate reward of +10. Path B involves several steps with no immediate reward, but ultimately leads to a reward of +100. Two separate agents are trained for this task: Agent 1 uses a discount factor of 0.1, and Agent 2 uses a discount factor of 0.9. Based on these settings, which outcome is most likely?
Calculating Discounted Return
Selecting an Appropriate Discount Factor