1Cademy - REINFORCE Algorithm (Monte-Carlo Policy Gradient)

Learn Before

Policy Gradient Methods for Deep Reinforcement Learning

Concept

REINFORCE Algorithm (Monte-Carlo Policy Gradient)

This algorithm uses Monte-Carlo to create episodes according to the policy 𝜋𝜃, and then for each episode, it iterates over the states of the episode and computes the total return G(t). The it uses G(t) and score function ∇L tog 𝜋(s,a; $\theta$ ) to learn the parameter 𝜃.

Input: A differentiable policy $\pi(s,a; \theta)$ Algorithm parameter: step size $\alpha>0$ Initialize policy parameter $\theta \in \mathcal{R}^d$ (e.g. to 0) Loop as long as you want:

Generate an episode $s_0, a_0, r_1,...,s_{T-1}, a_{T-1}, r_T$ , following $\pi$
Loop for each step of the episode $t = 0,1,...,T-1$ $t = 0, 1, ..., T - 1$ :
- $G\leftarrow$ return from step $t ~(G_t)$
- $\theta \leftarrow \theta + \alpha \gamma^t G \nabla_{\theta} log \pi(s_t,a_t;\theta)$

0

1

Updated 2020-10-17

Contributors are:

Yue Kuang

🏆 2

Who are from:

University of Michigan - Ann Arbor

🏆 2

References

Reference for Deep Reinforcement Learning

Learn Before

Related