Valued-based methods aim to find value functions. The advantage of learning the value function is that we can now select actions without a model of the Markov Decision Process.
E.g., in Q learning, the optimal policy is given by
$$
\pi^{\star}(s)=\underset{a \in \mathcal{A}}{\operatorname{argmax}} \hat{Q}^{\star}(s, a)
$$

University of Michigan - Ann Arbor

The value-based class of algorithms aims to build a value function, which subsequently lets us define a policy. Most frequently used algorithms including Q-learning and deep Q-learning.

Value-based Methods for Deep Reinforcement Learning

http://icaps18.icaps-conference.org/fileadmin/alg/conferences/icaps18/summerschool/lectures/Lecture5-rl-intro.pdf

Slides from CMU: Introduction to Deep Reinforcement Learning

In value-based methods, we calculate the Q values for all possible actions in the action space, but in real world , the number of actions can be a lot or can’t be imaginable. For example, the Robot walking problem has continuous and high dimensional action space. 

In policy based methods, we directly learn our policy function π without worrying about a value function, which means we can choose actions without calculate the Q(S,A) values.

Value-based Methods vs. Policy-based Methods

One value function we frequently used in value-based methods is the action-value function Q(s,a), which represents the total value of taking action a in state s. It is the sum of the future rewards r, adjusted by a discount factor gamma.
$$Q^{*}(s, a)=\max _{\pi} \mathbb{E}\left[r_{t}+\gamma r_{t+1}+\gamma^{2} r_{t+2}+\ldots \mid s_{t}=s, a_{t}=a, \pi\right]$$

The basic steps of deep Q-learning algorithms:
1. Train convolutional neural network to extract the essential features that can help the agent make the decision.
2. Calculate the Q-Value of each possible action.
3. Perform back-propagation to find the most accurate Q-Values.

Learn Before

Related