Hacker News new | ask | show | jobs
by gliptic 1116 days ago
> For CPU it is logical to expect 12x training speed improvement, but it is still too slow to be consistent practically useful.

I don't see what you base this on. MeZO trades one back-propagation pass for another forward pass. Why would that be 12x faster? It's also clear the convergence rate is slower than plain SGD (never mind AdamW) by a factor proportional to the effective rank of the Hessian.