Hacker News new | ask | show | jobs
by Ilasky 1053 days ago
What kind of hardware do I need to run this sufficiently well? I.e. say I want 10 tokens/s, what specs am I looking at?
2 comments

This particular model has 83.66gb of model weights so you'll need to 2x Nvidia 80gb A100 at a minimum unless you're loading it in 8bit mode.
With that said, there are ggml/gptq and other optimization techniques.
Pretty much anything with 32GB (?) total RAM+VRAM:

https://github.com/cmp-nct/ggllm.cpp

But its going to be slow without even a small Nvidia GPU (a 2060?). CPUs are really slow at prompt ingestion, and that can't be hidden with streaming.

Doesn't this new version of falcon need to be ggml'ed first?
The architecture is the same I belive, it's just a fine tune so there's nothing special to be done for this version. That said, ggml doesn't support Falcon, but i saw today there is a fork that claims to, though I didn't try it.
That link above is the fork ^

It uses the ggml library, just like llama.cpp does, and is indeed a fork of llama.cpp's implementation of ggml.

Right, I'm being stupid, that's the fork I saw earlier today I didn't realize. Have you tried it? Iirc the documentation mentioned at 2-bit quantizatikn of the 40B model performing well. I've been using a 5-bit 7B llama2 which I'm generally happy with (because it can run in a pretty crappy machine) but interested to see the differences.
I wouldn't go lower than Q3_K_S, as its basically the same filesize, and llama 33B has a big perplexity dropoff.