In value-based methods, we calculate the Q values for all possible actions in the action space, but in real world , the number of actions can be a lot or can’t be imaginable. For example, the Robot walking problem has continuous and high dimensional action space. 

In policy based methods, we directly learn our policy function π without worrying about a value function, which means we can choose actions without calculate the Q(S,A) values.

University of Michigan - Ann Arbor

Policy-based methods learn the policy function directly, without calculating a value function for each action. These methods use a learning signal derived from sampling
instantiations of policy parameters and the set of policies is developed
towards policies that achieve better returns.

An example of a policy-based algorithm is Policy Gradient Methods.

Policy-Based Methods for Deep Reinforcement Learning

The value-based class of algorithms aims to build a value function, which subsequently lets us define a policy. Most frequently used algorithms including Q-learning and deep Q-learning.

Value-based Methods for Deep Reinforcement Learning

A helpful website that explains how policy-based deep reinforcement learning works:
https://medium.com/deep-math-machine-learning-ai/ch-13-deep-reinforcement-learning-deep-q-learning-and-policy-gradients-towards-agi-a2a0b611617e

Reference for Policy-Based Method in Deep Reinforcement Learning

In policy gradient methods, we directly learn the policy function $$\pi$$, which outputs a probability distribution over actions. The term $$\pi(s,a;\theta) \in [0,1]$$ represents the probability of taking action $$a$$ given state $$s$$ with parameters $$\theta$$. Neural networks can be used to find the policy function, taking the state as input and producing the probability distribution of actions.

The general process is:
- The agent takes in a state and computes the probability of each action.
- It samples an action based on this probability distribution and observes the next state and reward.
- This cycle repeats until the end of the episode (game) and the total reward is evaluated.
- The parameters $$\theta$$ in the network are updated using backpropagation and gradient ascent based on the rewards.

Through this process, the network allows the agent to play and explore, gradually increasing the probabilities of actions that lead to positive returns.

Policy Gradient Methods for Deep Reinforcement Learning

Value-based Methods vs. Policy-based Methods

One value function we frequently used in value-based methods is the action-value function Q(s,a), which represents the total value of taking action a in state s. It is the sum of the future rewards r, adjusted by a discount factor gamma.
$$Q^{*}(s, a)=\max _{\pi} \mathbb{E}\left[r_{t}+\gamma r_{t+1}+\gamma^{2} r_{t+2}+\ldots \mid s_{t}=s, a_{t}=a, \pi\right]$$

The basic steps of deep Q-learning algorithms:
1. Train convolutional neural network to extract the essential features that can help the agent make the decision.
2. Calculate the Q-Value of each possible action.
3. Perform back-propagation to find the most accurate Q-Values.

Deep Q-learning

Valued-based methods aim to find value functions. The advantage of learning the value function is that we can now select actions without a model of the Markov Decision Process.
E.g., in Q learning, the optimal policy is given by
$$
\pi^{\star}(s)=\underset{a \in \mathcal{A}}{\operatorname{argmax}} \hat{Q}^{\star}(s, a)
$$

Learn Before

Related