In model-based reinforcement learning, the model may be known or learned. In the latter case, we run a base policy, like a random or any educated policy, and observe the trajectory. 

 1. run base policy $\pi_0(s_t, a_t)$ to collect $D = \{ (s,a,s')_i \}$
 2. learn dynamics model $f(s,a)$ to minimize $\sum_i ||f(s_i,a_i) - s_i' ||^2$ 
 3. backpropagate through $f(s,a)$ into the policy to optimize $\pi_{\theta} (s_t,a_t)$
 4. run  $\pi_{\theta} (s_t,a_t)$ add the resulting data $\{  (x,u,x')_j  \}$ to $D$
 5. repeat from step 2

In step 2 above, we use supervised learning to train a model to minimize the least square error from the sampled trajectory

In step 3, we can use the model to predict the next state given an action, then we use the policy to decide the next action, and use the state and action to computer the cost. Finally, we backpropagate the cost to train the policy.

We continue sample and fit the model as we move along the path.

Learn the Model in Model-Based Methods for Deep RL

When a model of the environment ( (the estimated transition function and
the estimated reward function) is available, the model can then act as a proxy for the actual environment. 
 - In many games, the rule of the game is the model.
 - In other cases, it can be the law of Physics. Sometimes, we know how to model it and build simulators for it.

We can define this model with rules or equations. Or, we can use the Gaussian Process, Gaussian Mixture Model (GMM) or deep networks. To fit these models, we run a controller to collect sample trajectories and train the models with supervised learning.




University of Michigan - Ann Arbor

There are different types of deep reinforcement learning methods.

If we use either value functions or policies to act on the environment, the method is called Model-free reinforcement learning:
   * Value-based methods
   * Policy-based methods

If we make use of models of the environment, it is referred to as Model-based reinforcement learning.

Deep Reinforcement Learning Methods

An Introduction to Deep Reinforcement Learning
https://arxiv.org/pdf/1811.12560.pdf

Reference for Deep Reinforcement Learning

Policy-based methods learn the policy function directly, without calculating a value function for each action. These methods use a learning signal derived from sampling
instantiations of policy parameters and the set of policies is developed
towards policies that achieve better returns.

An example of a policy-based algorithm is Policy Gradient Methods.

Policy-Based Methods for Deep Reinforcement Learning

The value-based class of algorithms aims to build a value function, which subsequently lets us define a policy. Most frequently used algorithms including Q-learning and deep Q-learning.

Value-based Methods for Deep Reinforcement Learning

Model-Based Methods for Deep Reinforcement Learning

The respective strengths of the model-free versus model-based approaches depend on different factors.

 - First, the best suited approach depends on whether the agent has access to a model of the environment. If that’s not the case, the learned model usually has some inaccuracies that should be taken into account.
 - Second, a model-based approach requires working in conjunction with a planning algorithm (or controller), which is often computationally demanding. The time constraints for computing the policy π(s) need to be taken into account.
 - Third, for some tasks, the structure of the policy (or value function) is the easiest one to learn, but for other tasks, the model of the environment may be learned more efficiently due to the particular structure of the task. Thus, which one performs better depends on the structure of the model, policy, and value function

Learn Before

Related

Learn After