|
|
|
|
|
by loser777
2022 days ago
|
|
At a few hundred microseconds per step, the benchmark steps start to approach the overhead of GPU kernel invocation and memory allocation.
1875 training steps means a batch size of 32, which is hardly optimal given the extremely small model size here.
The same effect is the reason why CPUs remain latency competitive for batch size 1 inference, especially for small models. |
|