| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gliptic 1116 days ago
	> For CPU it is logical to expect 12x training speed improvement, but it is still too slow to be consistent practically useful. I don't see what you base this on. MeZO trades one back-propagation pass for another forward pass. Why would that be 12x faster? It's also clear the convergence rate is slower than plain SGD (never mind AdamW) by a factor proportional to the effective rank of the Hessian.