Concept
Off-Policy Policy Target
In model-based reinforcement learning, incorporating an off-policy policy target () into algorithms like MuZero primarily provides benefits in specific environments such as Cartpole. While runs using this policy target demonstrate fast initial convergence, they typically stagnate at values lower than the standard MuZero baseline, ultimately impairing long-term convergence. The effectiveness of the off-policy policy target is highly dependent on the number of simulations, but it may offer utility in the very early stages of training.
0
1
Updated 2026-06-15
Contributors are:
Who are from:
Tags
Data Science