An autonomous agent receives the following sequence of rewards starting from the next time-step (t+1): +5, +2, +10. The episode ends after the third reward. If the agent uses a discount factor (γ) of 0.5, what is the total discounted return (G_t) from the current time-step (t)?

Google

Our final goal is to choose actions over time so that we could maximize the expected value of the return. The definition of return is as below:
The return $G_t$ is the total discounted reward from time-step t.
$G_t = R_{t+1} + γR_{t+2} + · · · = \sum_{k=0}^{∞}γ^kR_{t+k+1}$, where γ is the discounted factor.
When γ close to 0 leads to “myopic” evaluation; γ close to 1 leads to “far-sighted” evaluation.

Return

An autonomous agent is being trained to navigate a grid. From its current position, it can choose one of two paths. Path A leads to an immediate reward of +10. Path B involves several steps with no immediate reward, but ultimately leads to a reward of +100. Two separate agents are trained for this task: Agent 1 uses a discount factor of 0.1, and Agent 2 uses a discount factor of 0.9. Based on these settings, which outcome is most likely?

Calculating Discounted Return

For each agent described in the scenarios below, determine whether a 'myopic' evaluation (using a discount factor close to 0) or a 'far-sighted' evaluation (using a discount factor close to 1) would be more suitable for its training. Justify your reasoning for each choice.

Learn Before

Related