|
|
|
|
|
by stevenhuang
382 days ago
|
|
During inference, each token passes through each parameter of the model as a matrix vector products. And then as context grows, each new token passes through all current context tokens as matrix vector products. This means bandwidth requirements grow as context sizes grow. For datacenter workloads batching can be used to efficiently use this memory bandwidth and make things compute bound instead |
|
It seems to me that even if you pass in a long context on every prompt, that context is still tiny compared to the execution time on the processor/GPU/tensorcore/etc.
Lets say I load up a model of 12GB on my 12GB VRAM GPU. I pass in a prompt with 1MB of context which causes a response of 500kb after 1s. That's still only 1.5MB of IO transferred in 1s, which kept the GPU busy for 1s. Increasing the prompt is going to increase the duration to a response accordingly.
Unless the GPU is not fully utilised on each prompt-response cycle, I feel that the GPU is still the bottleneck here, not the bus performance.