Concept

Policy Gradient Methods for Deep Reinforcement Learning

In policy gradient methods, we directly learn the policy function π\pi, which outputs a probability distribution over actions. The term π(s,a;θ)[0,1]\pi(s,a;\theta) \in [0,1] represents the probability of taking action aa given state ss with parameters θ\theta. Neural networks can be used to find the policy function, taking the state as input and producing the probability distribution of actions.

The general process is:

  • The agent takes in a state and computes the probability of each action.
  • It samples an action based on this probability distribution and observes the next state and reward.
  • This cycle repeats until the end of the episode (game) and the total reward is evaluated.
  • The parameters θ\theta in the network are updated using backpropagation and gradient ascent based on the rewards.

Through this process, the network allows the agent to play and explore, gradually increasing the probabilities of actions that lead to positive returns.

Image 0

0

1

Updated 2026-06-13

Tags

Data Science