First pick the best actions according to the visit count. Then, unroll the environment by applying these actions to the environment obtaining the rewards needed to train.

How to create Simulated Trajectories?

They joined the reward loss and value loss into one component since their values are intertwined because they calculate the value based on the rewards. And they set hard weights and combine loss for real game trajectory and loss for greedy trajectory from simulated games.

University of Michigan - Ann Arbor

Since MuZero uses a replay buffer, one can consider it off-policy. However, viewed from the perspective of the behavior and training policies, it could also be considered on-policy.

Off and On-Policy in MuZero

Borges, Alexandre and Arlindo L. Oliveira. “Combining Off and On-Policy Training in Model-Based Reinforcement Learning.” ArXiv abs/2102.12194 (2021): n. pag.
https://arxiv.org/pdf/2102.12194.pdf

Combining Off and On-Policy Training in Model-Based
Reinforcement Learning

Experiments on three different conditions with different policies are proposed by this paper. In environments with sparse rewards, runs that use the off-policy value target with decay are able to have faster initial convergence and achieve higher rewards than MuZero. An environment with intermediate rewards, these runs were able to provide a faster initial convergence than MuZero, but stagnated to lower values. The combined algorithm converges faster than MuZero and no longer needs to assume that the environment is reversible.

Learn Before

Related

Learn After