Learn Before
Concept
Experimental Evaluation of combining on and off-policy steategies
Experiments on three different conditions with different policies are proposed by this paper. In environments with sparse rewards, runs that use the off-policy value target with decay are able to have faster initial convergence and achieve higher rewards than MuZero. An environment with intermediate rewards, these runs were able to provide a faster initial convergence than MuZero, but stagnated to lower values. The combined algorithm converges faster than MuZero and no longer needs to assume that the environment is reversible.
0
1
Updated 2021-08-19
Contributors are:
Who are from:
Tags
Data Science