| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Ilasky 1053 days ago
	What kind of hardware do I need to run this sufficiently well? I.e. say I want 10 tokens/s, what specs am I looking at?

2 comments

brianjking 1053 days ago

This particular model has 83.66gb of model weights so you'll need to 2x Nvidia 80gb A100 at a minimum unless you're loading it in 8bit mode.

link

brianjking 1053 days ago

With that said, there are ggml/gptq and other optimization techniques.

link

brucethemoose2 1053 days ago

Pretty much anything with 32GB (?) total RAM+VRAM:

https://github.com/cmp-nct/ggllm.cpp

But its going to be slow without even a small Nvidia GPU (a 2060?). CPUs are really slow at prompt ingestion, and that can't be hidden with streaming.

link

brianjking 1053 days ago

Doesn't this new version of falcon need to be ggml'ed first?

link

version_five 1053 days ago

The architecture is the same I belive, it's just a fine tune so there's nothing special to be done for this version. That said, ggml doesn't support Falcon, but i saw today there is a fork that claims to, though I didn't try it.

link

brucethemoose2 1053 days ago

That link above is the fork ^

It uses the ggml library, just like llama.cpp does, and is indeed a fork of llama.cpp's implementation of ggml.

link

version_five 1053 days ago

Right, I'm being stupid, that's the fork I saw earlier today I didn't realize. Have you tried it? Iirc the documentation mentioned at 2-bit quantizatikn of the 40B model performing well. I've been using a 5-bit 7B llama2 which I'm generally happy with (because it can run in a pretty crappy machine) but interested to see the differences.

link

brucethemoose2 1053 days ago

I wouldn't go lower than Q3_K_S, as its basically the same filesize, and llama 33B has a big perplexity dropoff.

link