Concept
REINFORCE Algorithm (Monte-Carlo Policy Gradient)
This algorithm uses Monte-Carlo to create episodes according to the policy 𝜋𝜃, and then for each episode, it iterates over the states of the episode and computes the total return G(t). The it uses G(t) and score function ∇L tog 𝜋(s,a; ) to learn the parameter 𝜃.
Input: A differentiable policy Algorithm parameter: step size Initialize policy parameter (e.g. to 0) Loop as long as you want:
- Generate an episode , following
- Loop for each step of the episode :
- return from step
0
1
Updated 2020-10-17
Tags
Data Science