Hacker News new | ask | show | jobs
by highfrequency 1955 days ago
Is this the first time that the gradient clip threshold has been chosen relative to the size of the weight matrix?