|
|
|
|
|
by sheikheddy
1365 days ago
|
|
> Intuitively, we only mask if the current value of the online network is outside of the trust region and the sign of the TD-error points away from the trust
region. Seems like this is where most of the improvement comes from. Anyone have an analogy to help explain why this works? |
|