|
|
|
|
|
by vvladymyrov
1116 days ago
|
|
Inference can be done on CPU+RAM, but it is much slower (like tens of seconds per token). So reducing the amount of memory used by model during training would reduce the number of compute operations potentially could make CPU+RAM more suitable for fine tuning within reasonable amount of time too.
Basically 12x less GPU memory requirement also translates to 12x less compute operations (doing compute on CPU allows less parallelism them on GPU). The paper doesn’t focus on CPU or GPU training time improvements - I’d assume there is no significant improvement in GPU training case. For CPU it is logical to expect 12x training speed improvement, but it is still too slow to be consistent practically useful. |
|
I don't see what you base this on. MeZO trades one back-propagation pass for another forward pass. Why would that be 12x faster? It's also clear the convergence rate is slower than plain SGD (never mind AdamW) by a factor proportional to the effective rank of the Hessian.