Google

The cumulative reward, also referred to as the return, for a specific state-action sequence (or trajectory) is determined by adding together all the individual rewards accumulated during that sequence. If we denote a trajectory consisting of $$T$$ time steps as $$\tau = \{(s_1,a_1),...,(s_T,a_T)\}$$, the total cumulative reward $$R(\tau)$$ is formally calculated as the sum of the rewards $$r_t$$ from the first step to the final step:

$$R(\tau) = \sum_{t=1}^{T} r_t$$

Cumulative Reward of a Trajectory

The primary objective in reinforcement learning is to develop a policy that enables an agent to maximize the total cumulative reward, also known as the return, that it accumulates over an extended period of interaction with its environment.

Goal of Reinforcement Learning

An agent's interaction with an environment is recorded as a sequence of steps, where a numerical reward is assigned after each action. Given the following interaction log for a single episode, calculate the total reward accumulated by the agent from the beginning to the end of the sequence.

Agent Performance Calculation

An agent interacts with an environment over a sequence of four time steps. The rewards it receives at each step are as follows: r₁ = +3, r₂ = -1, r₃ = +5, r₄ = -2. What is the total cumulative reward for this entire sequence?

Consider an agent that completes a five-step sequence of actions, receiving the following rewards at each step: [-5, +1, +1, +1, 0]. This sequence is preferable to another sequence that consists of a single step with a reward of -1.

Learn Before

Related