For each agent described in the scenarios below, determine whether a 'myopic' evaluation (using a discount factor close to 0) or a 'far-sighted' evaluation (using a discount factor close to 1) would be more suitable for its training. Justify your reasoning for each choice.

Google

Our final goal is to choose actions over time so that we could maximize the expected value of the return. The definition of return is as below:
The return $G_t$ is the total discounted reward from time-step t.
$G_t = R_{t+1} + γR_{t+2} + · · · = \sum_{k=0}^{∞}γ^kR_{t+k+1}$, where γ is the discounted factor.
When γ close to 0 leads to “myopic” evaluation; γ close to 1 leads to “far-sighted” evaluation.

Return

An autonomous agent is being trained to navigate a grid. From its current position, it can choose one of two paths. Path A leads to an immediate reward of +10. Path B involves several steps with no immediate reward, but ultimately leads to a reward of +100. Two separate agents are trained for this task: Agent 1 uses a discount factor of 0.1, and Agent 2 uses a discount factor of 0.9. Based on these settings, which outcome is most likely?

An autonomous agent receives the following sequence of rewards starting from the next time-step (t+1): +5, +2, +10. The episode ends after the third reward. If the agent uses a discount factor (γ) of 0.5, what is the total discounted return (G_t) from the current time-step (t)?

Learn Before

Related