|
|
|
|
|
by kcorbitt
380 days ago
|
|
It seems like the speedups here are most useful for small models, since on larger models a smaller fraction of the total time would be spent swapping between kernels? Would be interesting to see at least theoretical results for LLMs in the 14-70B parameter range, which is what most folks deploy in practice. And of course the effect on throughput at larger batch sizes, which they allude to at the end. Overall a very interesting result! |
|