Concept

REINFORCE Algorithm (Monte-Carlo Policy Gradient)

This algorithm uses Monte-Carlo to create episodes according to the policy 𝜋𝜃, and then for each episode, it iterates over the states of the episode and computes the total return G(t). The it uses G(t) and score function ∇L tog 𝜋(s,a; θ\theta) to learn the parameter 𝜃.

Input: A differentiable policy π(s,a;θ)\pi(s,a; \theta) Algorithm parameter: step size α>0\alpha>0 Initialize policy parameter θRd\theta \in \mathcal{R}^d (e.g. to 0) Loop as long as you want:

  • Generate an episode s0,a0,r1,...,sT1,aT1,rTs_0, a_0, r_1,...,s_{T-1}, a_{T-1}, r_T, following π\pi
  • Loop for each step of the episode t=0,1,...,T1t = 0,1,...,T-1:
    • GG\leftarrow return from step t (Gt)t ~(G_t)
    • θθ+αγtGθlogπ(st,at;θ)\theta \leftarrow \theta + \alpha \gamma^t G \nabla_{\theta} log \pi(s_t,a_t;\theta)

0

1

Updated 2020-10-17

Tags

Data Science