Hacker News new | ask | show | jobs
by sheikheddy 1365 days ago
> Intuitively, we only mask if the current value of the online network is outside of the trust region and the sign of the TD-error points away from the trust region.

Seems like this is where most of the improvement comes from. Anyone have an analogy to help explain why this works?