Learn Before
Concept

How to combine on-policy and off-policy strategy of MuZero?

They joined the reward loss and value loss into one component since their values are intertwined because they calculate the value based on the rewards. And they set hard weights and combine loss for real game trajectory and loss for greedy trajectory from simulated games.

0

1

Updated 2021-08-19

Tags

Data Science