Hacker News new | ask | show | jobs
by aortega 1192 days ago
That is still a 4000 usd computer. You can get 2 RTX3900 used for ~1000 usd and run 65B much faster.

I have a discord server up serving almost 500 users with 65B.

https://twitter.com/ortegaalfredo/status/1635402627327590400

For some things is better than GPT3, for other even Alpaca is better.

2 comments

How do you make it load on two GPUs or does llama.cpp does it automatically? I have a setup with a threadrippper and a RTX3090 and a Titan RTX. I haven't had the time to set it up so that's why I have been using my Mac.
llama.cpp doesn't use the GPU at all. The genius *.cpp (whisper.cpp, llama.cpp) projects are specifically intended to optimize/democratize otherwise GPU only models to run on CPU/non-GPU (CUDA, ROCm). Technically speaking the released models are capable of running on GPU via standard framework (PyTorch, TensorFlow) support for CPU but in practice without a lot of optimization they are incredibly slow to the point of useless, hence *.cpp.

You want something along these lines (warning: unnecessarily potentially offensive):

https://rentry.org/llama-tard-v2

Llama.cpp takes advantage that LLaMa 7B is a tiny, very optimized model. It would run in anything, and very fast. I really doubt you can run the 30B or 65B models at acceptable speed on a CPU at least for a couple years. (I'm ready to eat my words in a couple weeks)
Okay my thread ripper can handle it because it has a 128GB of Ram.
Thanks for taking the time to set this up. I will definitely give it a go later today. I don't have access to hardware that I can run LLaMA on and I'm really curious to see what the 65B model has to offer.