Hacker News new | ask | show | jobs
by gpm 1200 days ago
I don't know about this fork specifically, but in general yes absolutely.

Even without enough ram, you can stream model weights from disk and run at [size of model/disk read speed] seconds per token.

I'm doing that on a small GPU with this code, but it should be easy to get this working with the CPU as compute instead (and at least with my disk/CPU, I'm not even sure that it would run even slower, I think disk read would probably still be the bottleneck)

A lack of an absurd number of CPUs just means it's slow, not impossible.

https://github.com/gmorenz/llama/tree/ssd

1 comments

Yeah, I find this area fascinating. Like, it's very cool to run a 7B params model locally, but it must feel like a toy when compared to ChatGPT, for example.

However, the 65B parameter, according to the benchmarks, is such a beast that you might be able to do some things on it that are not possible on ChatGPT (despite all of ChatGPT's quality of life features). Amazing times.