| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by haolez 1204 days ago
	Would it be possible to run the 65B one like this as well? Is the bottleneck just the RAM, or would I need an absurd number of CPUs as well? It's not that hard to create a consumer-grade desktop with 256GB in 2023.

3 comments

gpm 1204 days ago

I don't know about this fork specifically, but in general yes absolutely.

Even without enough ram, you can stream model weights from disk and run at [size of model/disk read speed] seconds per token.

I'm doing that on a small GPU with this code, but it should be easy to get this working with the CPU as compute instead (and at least with my disk/CPU, I'm not even sure that it would run even slower, I think disk read would probably still be the bottleneck)

A lack of an absurd number of CPUs just means it's slow, not impossible.

https://github.com/gmorenz/llama/tree/ssd

link

haolez 1204 days ago

Yeah, I find this area fascinating. Like, it's very cool to run a 7B params model locally, but it must feel like a toy when compared to ChatGPT, for example.

However, the 65B parameter, according to the benchmarks, is such a beast that you might be able to do some things on it that are not possible on ChatGPT (despite all of ChatGPT's quality of life features). Amazing times.

link

downvotetruth 1204 days ago

You don't need 256 GB. A pair of the new 48GB DDR5 will work along with a pair of 32GB sticks should work in a consumer DDR5 MB to fit the weights. It does burst when initially loading. So, a fast disk with about the same swap size as RAM seems necessary. It took about 25 mins to generate a single 500 character response using a 5800X & 32 GB DDR4, but I was not able to get to it to run on more than 1 thread with the 7B model.

link

Tepix 1203 days ago

All current Ryzen CPUs do not work with 48GB DDR5, right? That means if you want to go beyond 128GB you can get an old X399 board (there are some reports of people getting 256GB to work) or more recent Threadripper boards.

link

downvotetruth 1202 days ago

Current Ryzen CPUs do not work with either 24GB or 48GB DDR5.

link

downvotetruth 1203 days ago

Follow up: https://github.com/facebookresearch/llama/issues/79#issuecom... claims 65B was able to fit in 128 GB by unsharding & merging weights into a single file instead of the multiple pth with 172Gb max swap file usage & appears to stream to GPU.

link

haolez 1204 days ago

Why? Is it a limitation of the model or just something with the configuration that you couldn't figure out for this test?

link

downvotetruth 1204 days ago

I tried mark's OMP_NUM_THREADS suggestion (https://news.ycombinator.com/item?id=35018559), did not see any an obvious change to make it parallel, and given the huggingface patch (https://github.com/huggingface/transformers/pull/21955) once it gets in is suppose to allow streaming from RAM to the GPU. So, for me it was not worth the effort to keep working on the CPU version as even the best case ~30X speedup will still take around a minute to run the 7B.

link

basch 1204 days ago

I wonder if we will start to see complex prune functions and tools start to pop up.

So before you start a task, you sort of describe the domain, and the model is separated into the third most useful and relevant to that topic/query, and 2/3rd most distant from that realm. Then either just the 1/3rd is used in a detached fashion, or it works as 2 layers of cache, one in ram one on disk.

link