Hacker News new | ask | show | jobs
by loser777 2022 days ago
At a few hundred microseconds per step, the benchmark steps start to approach the overhead of GPU kernel invocation and memory allocation. 1875 training steps means a batch size of 32, which is hardly optimal given the extremely small model size here. The same effect is the reason why CPUs remain latency competitive for batch size 1 inference, especially for small models.