Only provided benefits in Cartpole. Runs had fast initial convergence, but quickly stagnated to values lower than MuZero, and thus it seemingly does not improve training and impairs convergence. More dependent on the number of simulations than the value target, however, based on h on the M0ALL training curve, the target seems to be useful at the beginning.

Off-Policy Policy Target $\delta$

The combination of on-policy and off-policy strategies is adjusted by adjusting the weights in their loss function. According to their experiments, they claim several points. First, the usage of the off-policy value target provides a clear benefit. Second, in an environment with intermediate rewards, the off-policy value target might not provide any new information. But they are leading the model to find and quickly overfit to a
sub-optimal strategy. And as related weights decay, they will find optimal strategy finally and lead to a faster convergence.

Comparing Environments

In the context of model-based reinforcement learning, utilizing an off-policy value target ($$\gamma$$) generally provides faster initial convergence speeds across various environments. However, experimental comparisons demonstrate that relying solely on this off-policy target is often insufficient; it must be combined with an on-policy value target to maintain stable learning and achieve optimal long-term performance.

Off-Policy Value Target $$\gamma$$

Experiments on three different conditions with different policies are proposed by this paper. In environments with sparse rewards, runs that use the off-policy value target with decay are able to have faster initial convergence and achieve higher rewards than MuZero. An environment with intermediate rewards, these runs were able to provide a faster initial convergence than MuZero, but stagnated to lower values. The combined algorithm converges faster than MuZero and no longer needs to assume that the environment is reversible.

University of Toledo

University of Michigan - Ann Arbor

Since MuZero uses a replay buffer, one can consider it off-policy. However, viewed from the perspective of the behavior and training policies, it could also be considered on-policy.

Off and On-Policy in MuZero

Borges, Alexandre and Arlindo L. Oliveira. “Combining Off and On-Policy Training in Model-Based Reinforcement Learning.” ArXiv abs/2102.12194 (2021): n. pag.
https://arxiv.org/pdf/2102.12194.pdf

Combining Off and On-Policy Training in Model-Based
Reinforcement Learning

Experimental Evaluation of combining on and off-policy steategies

They joined the reward loss and value loss into one component since their values are intertwined because they calculate the value based on the rewards. And they set hard weights and combine loss for real game trajectory and loss for greedy trajectory from simulated games.

Learn Before

Related

Learn After