Concept

Off-Policy Policy Target δ\delta

Only provided benefits in Cartpole. Runs had fast initial convergence, but quickly stagnated to values lower than MuZero, and thus it seemingly does not improve training and impairs convergence. More dependent on the number of simulations than the value target, however, based on h on the M0ALL training curve, the target seems to be useful at the beginning.

0

1

Updated 2021-08-19

Tags

Data Science