|
|
|
|
|
by Tuna-Fish
87 days ago
|
|
On some workloads, swapping is a bad idea. The fundamental problem here is that the workload of LLMs is (vastly simplified) a repeated linear read of all the weights, in order. That is, there is no memory locality in time. There is literally anti-locality; When you read a set of weights, you know you will not need them again until you have processed everything else. This means that many of the old approaches don't work, because time locality is such a core assumption underlying all of them. The best you can do is really a very large pool of very fast ram. In the long term, compute is probably going to move towards the memory. |
|