I don't know about this fork specifically, but in general yes absolutely.
Even without enough ram, you can stream model weights from disk and run at [size of model/disk read speed] seconds per token.
I'm doing that on a small GPU with this code, but it should be easy to get this working with the CPU as compute instead (and at least with my disk/CPU, I'm not even sure that it would run even slower, I think disk read would probably still be the bottleneck)
A lack of an absurd number of CPUs just means it's slow, not impossible.
Yeah, I find this area fascinating. Like, it's very cool to run a 7B params model locally, but it must feel like a toy when compared to ChatGPT, for example.
However, the 65B parameter, according to the benchmarks, is such a beast that you might be able to do some things on it that are not possible on ChatGPT (despite all of ChatGPT's quality of life features). Amazing times.
You don't need 256 GB. A pair of the new 48GB DDR5 will work along with a pair of 32GB sticks should work in a consumer DDR5 MB to fit the weights. It does burst when initially loading. So, a fast disk with about the same swap size as RAM seems necessary. It took about 25 mins to generate a single 500 character response using a 5800X & 32 GB DDR4, but I was not able to get to it to run on more than 1 thread with the 7B model.
All current Ryzen CPUs do not work with 48GB DDR5, right?
That means if you want to go beyond 128GB you can get an old X399 board (there are some reports of people getting 256GB to work) or more recent Threadripper boards.
I tried mark's OMP_NUM_THREADS suggestion (https://news.ycombinator.com/item?id=35018559), did not see any an obvious change to make it parallel, and given the huggingface patch (https://github.com/huggingface/transformers/pull/21955) once it gets in is suppose to allow streaming from RAM to the GPU. So, for me it was not worth the effort to keep working on the CPU version as even the best case ~30X speedup will still take around a minute to run the 7B.
I wonder if we will start to see complex prune functions and tools start to pop up.
So before you start a task, you sort of describe the domain, and the model is separated into the third most useful and relevant to that topic/query, and 2/3rd most distant from that realm. Then either just the 1/3rd is used in a detached fashion, or it works as 2 layers of cache, one in ram one on disk.
Even without enough ram, you can stream model weights from disk and run at [size of model/disk read speed] seconds per token.
I'm doing that on a small GPU with this code, but it should be easy to get this working with the CPU as compute instead (and at least with my disk/CPU, I'm not even sure that it would run even slower, I think disk read would probably still be the bottleneck)
A lack of an absurd number of CPUs just means it's slow, not impossible.
https://github.com/gmorenz/llama/tree/ssd