|
|
|
|
|
by ein0p
437 days ago
|
|
Several incorrect assumptions in this take. For one thing, 16 bit is not necessary. For another 140GB/token holds only if your batch size is 1 and your sequence length is 1 (no speculative decoding). Nobody runs LLMs like that on those GPUs - if you do it like that, compute utilization becomes ridiculously low. With batch of greater than 1 and speculative decoding arithmetic intensity of the kernels is much higher, and having weights "off chip" is not that much of a concern. |
|