Hacker News new | ask | show | jobs
by high_derivative 2424 days ago
Thank you for responding. Well, my point is that in particular the gradient on the likelihood ratio is what trips people up. They ask questions like 'why is this ratio not always 1' or similar. This is why I would say explaining what is going where here is critical, i.e. that we save the prior logp_pi (even though we could recompute it) to treat it as a constant value when computing the ratio/the gradient. That would be, from my perspective, the key pedagogical moment of a PPO tutorial. However his is purely subjective and I agree that one can feel differently about where to put explanations.