Concept

Off-Policy Policy Target δ\delta

In model-based reinforcement learning, incorporating an off-policy policy target (δ\delta) into algorithms like MuZero primarily provides benefits in specific environments such as Cartpole. While runs using this policy target demonstrate fast initial convergence, they typically stagnate at values lower than the standard MuZero baseline, ultimately impairing long-term convergence. The effectiveness of the off-policy policy target is highly dependent on the number of simulations, but it may offer utility in the very early stages of training.

0

1

Updated 2026-06-15

Tags

Data Science