|
|
|
|
|
by gliptic
1116 days ago
|
|
> For CPU it is logical to expect 12x training speed improvement, but it is still too slow to be consistent practically useful. I don't see what you base this on. MeZO trades one back-propagation pass for another forward pass. Why would that be 12x faster?
It's also clear the convergence rate is slower than plain SGD (never mind AdamW) by a factor proportional to the effective rank of the Hessian. |
|