| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by loser777 2022 days ago
	At a few hundred microseconds per step, the benchmark steps start to approach the overhead of GPU kernel invocation and memory allocation. 1875 training steps means a batch size of 32, which is hardly optimal given the extremely small model size here. The same effect is the reason why CPUs remain latency competitive for batch size 1 inference, especially for small models.