Learn Before
Concept

Experimental Evaluation of combining on and off-policy steategies

Experiments on three different conditions with different policies are proposed by this paper. In environments with sparse rewards, runs that use the off-policy value target with decay are able to have faster initial convergence and achieve higher rewards than MuZero. An environment with intermediate rewards, these runs were able to provide a faster initial convergence than MuZero, but stagnated to lower values. The combined algorithm converges faster than MuZero and no longer needs to assume that the environment is reversible.

0

1

Updated 2021-08-19

Tags

Data Science