Concept
Comparing Environments
The combination of on-policy and off-policy strategies is adjusted by adjusting the weights in their loss function. According to their experiments, they claim several points. First, the usage of the off-policy value target provides a clear benefit. Second, in an environment with intermediate rewards, the off-policy value target might not provide any new information. But they are leading the model to find and quickly overfit to a sub-optimal strategy. And as related weights decay, they will find optimal strategy finally and lead to a faster convergence.
0
1
Updated 2021-08-19
Tags
Data Science