|
|
|
|
|
by gpm
1200 days ago
|
|
I don't know about this fork specifically, but in general yes absolutely. Even without enough ram, you can stream model weights from disk and run at [size of model/disk read speed] seconds per token. I'm doing that on a small GPU with this code, but it should be easy to get this working with the CPU as compute instead (and at least with my disk/CPU, I'm not even sure that it would run even slower, I think disk read would probably still be the bottleneck) A lack of an absurd number of CPUs just means it's slow, not impossible. https://github.com/gmorenz/llama/tree/ssd |
|
However, the 65B parameter, according to the benchmarks, is such a beast that you might be able to do some things on it that are not possible on ChatGPT (despite all of ChatGPT's quality of life features). Amazing times.