The combination of on-policy and off-policy strategies is adjusted by adjusting the weights in their loss function. According to their experiments, they claim several points. First, the usage of the off-policy value target provides a clear benefit. Second, in an environment with intermediate rewards, the off-policy value target might not provide any new information. But they are leading the model to find and quickly overfit to a
sub-optimal strategy. And as related weights decay, they will find optimal strategy finally and lead to a faster convergence.

University of Michigan - Ann Arbor

Experiments on three different conditions with different policies are proposed by this paper. In environments with sparse rewards, runs that use the off-policy value target with decay are able to have faster initial convergence and achieve higher rewards than MuZero. An environment with intermediate rewards, these runs were able to provide a faster initial convergence than MuZero, but stagnated to lower values. The combined algorithm converges faster than MuZero and no longer needs to assume that the environment is reversible.

Experimental Evaluation of combining on and off-policy steategies

Borges, Alexandre and Arlindo L. Oliveira. “Combining Off and On-Policy Training in Model-Based Reinforcement Learning.” ArXiv abs/2102.12194 (2021): n. pag.
https://arxiv.org/pdf/2102.12194.pdf

Combining Off and On-Policy Training in Model-Based
Reinforcement Learning

Comparing Environments

In the context of model-based reinforcement learning, utilizing an off-policy value target ($$\gamma$$) generally provides faster initial convergence speeds across various environments. However, experimental comparisons demonstrate that relying solely on this off-policy target is often insufficient; it must be combined with an on-policy value target to maintain stable learning and achieve optimal long-term performance.

Off-Policy Value Target $$\gamma$$

In model-based reinforcement learning, incorporating an off-policy policy target ($$\delta$$) into algorithms like MuZero primarily provides benefits in specific environments such as Cartpole. While runs using this policy target demonstrate fast initial convergence, they typically stagnate at values lower than the standard MuZero baseline, ultimately impairing long-term convergence. The effectiveness of the off-policy policy target is highly dependent on the number of simulations, but it may offer utility in the very early stages of training.

Learn Before

Related