Hacker News new | ask | show | jobs
by quadrature 251 days ago
I'm not very well versed, but i believe that training requires more memory to store intermediate computations so that you can calculate gradients for each layer.