Hacker News new | ask | show | jobs
by brucethemoose2 1052 days ago
Pretty much anything with 32GB (?) total RAM+VRAM:

https://github.com/cmp-nct/ggllm.cpp

But its going to be slow without even a small Nvidia GPU (a 2060?). CPUs are really slow at prompt ingestion, and that can't be hidden with streaming.

1 comments

Doesn't this new version of falcon need to be ggml'ed first?
The architecture is the same I belive, it's just a fine tune so there's nothing special to be done for this version. That said, ggml doesn't support Falcon, but i saw today there is a fork that claims to, though I didn't try it.
That link above is the fork ^

It uses the ggml library, just like llama.cpp does, and is indeed a fork of llama.cpp's implementation of ggml.

Right, I'm being stupid, that's the fork I saw earlier today I didn't realize. Have you tried it? Iirc the documentation mentioned at 2-bit quantizatikn of the 40B model performing well. I've been using a 5-bit 7B llama2 which I'm generally happy with (because it can run in a pretty crappy machine) but interested to see the differences.
I wouldn't go lower than Q3_K_S, as its basically the same filesize, and llama 33B has a big perplexity dropoff.