Hacker News new | ask | show | jobs
by rfoo 483 days ago
Do they even have an optimized backward? It looks like optimizations like this aren't needed during training. Their V2 paper also suggests so.