Concept
Off-Policy Policy Target
Only provided benefits in Cartpole. Runs had fast initial convergence, but quickly stagnated to values lower than MuZero, and thus it seemingly does not improve training and impairs convergence. More dependent on the number of simulations than the value target, however, based on h on the M0ALL training curve, the target seems to be useful at the beginning.
0
1
Updated 2021-08-19
Tags
Data Science