|
|
|
|
|
by VierScar
314 days ago
|
|
No I don't think it's the bits. I would say it's the computation. Inference requires performing a lot of matmul, and with more tokens the number of computation operations increases exponentially - O(n^2) at least. So increasing your context/conversation will quickly degrade performance I seriously doubt it's the throughput of memory during inference that's the bottleneck here. |
|