In policy gradient methods, we directly learn the policy function $$\pi$$, which outputs a probability distribution over actions. The term $$\pi(s,a;\theta) \in [0,1]$$ represents the probability of taking action $$a$$ given state $$s$$ with parameters $$\theta$$. Neural networks can be used to find the policy function, taking the state as input and producing the probability distribution of actions.

The general process is:
- The agent takes in a state and computes the probability of each action.
- It samples an action based on this probability distribution and observes the next state and reward.
- This cycle repeats until the end of the episode (game) and the total reward is evaluated.
- The parameters $$\theta$$ in the network are updated using backpropagation and gradient ascent based on the rewards.

Through this process, the network allows the agent to play and explore, gradually increasing the probabilities of actions that lead to positive returns.

Policy Gradient Methods for Deep Reinforcement Learning

In value-based methods, we calculate the Q values for all possible actions in the action space, but in real world , the number of actions can be a lot or can’t be imaginable. For example, the Robot walking problem has continuous and high dimensional action space. 

In policy based methods, we directly learn our policy function π without worrying about a value function, which means we can choose actions without calculate the Q(S,A) values.

Value-based Methods vs. Policy-based Methods

A helpful website that explains how policy-based deep reinforcement learning works:
https://medium.com/deep-math-machine-learning-ai/ch-13-deep-reinforcement-learning-deep-q-learning-and-policy-gradients-towards-agi-a2a0b611617e

Reference for Policy-Based Method in Deep Reinforcement Learning

Policy-based methods learn the policy function directly, without calculating a value function for each action. These methods use a learning signal derived from sampling
instantiations of policy parameters and the set of policies is developed
towards policies that achieve better returns.

An example of a policy-based algorithm is Policy Gradient Methods.

University of Michigan - Ann Arbor

There are different types of deep reinforcement learning methods.

If we use either value functions or policies to act on the environment, the method is called Model-free reinforcement learning:
   * Value-based methods
   * Policy-based methods

If we make use of models of the environment, it is referred to as Model-based reinforcement learning.

Deep Reinforcement Learning Methods

An Introduction to Deep Reinforcement Learning
https://arxiv.org/pdf/1811.12560.pdf

Reference for Deep Reinforcement Learning

Policy-Based Methods for Deep Reinforcement Learning

The value-based class of algorithms aims to build a value function, which subsequently lets us define a policy. Most frequently used algorithms including Q-learning and deep Q-learning.

Value-based Methods for Deep Reinforcement Learning

When a model of the environment ( (the estimated transition function and
the estimated reward function) is available, the model can then act as a proxy for the actual environment. 
 - In many games, the rule of the game is the model.
 - In other cases, it can be the law of Physics. Sometimes, we know how to model it and build simulators for it.

We can define this model with rules or equations. Or, we can use the Gaussian Process, Gaussian Mixture Model (GMM) or deep networks. To fit these models, we run a controller to collect sample trajectories and train the models with supervised learning.




Model-Based Methods for Deep Reinforcement Learning

The respective strengths of the model-free versus model-based approaches depend on different factors.

 - First, the best suited approach depends on whether the agent has access to a model of the environment. If that’s not the case, the learned model usually has some inaccuracies that should be taken into account.
 - Second, a model-based approach requires working in conjunction with a planning algorithm (or controller), which is often computationally demanding. The time constraints for computing the policy π(s) need to be taken into account.
 - Third, for some tasks, the structure of the policy (or value function) is the easiest one to learn, but for other tasks, the model of the environment may be learned more efficiently due to the particular structure of the task. Thus, which one performs better depends on the structure of the model, policy, and value function

Learn Before

Related

Learn After