Hacker News new | ask | show | jobs
by smiley1437 1105 days ago
>> run LLaMA 30B at almost 3 tokens/s

Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model

3 comments

With a single NVIDIA 3090 and the fastest inference branch of GPTQ-for-LLAMA https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/fastest-i..., I get a healthy 10-15 tokens per second on the 30B models. IMO GGML is great (And I totally use it) but it's still not as fast as running the models on GPU for now.
> IMO GGML is great (And I totally use it) but it's still not as fast as running the models on GPU for now.

I think it was originally designed to be easily embeddable—and most importantly, native code (i.e. not Python)—rather than competitive with GPUs.

I think it's just starting to get into GPU support now, but carefully.

Have you tried the most recent cuda offload? A dev claims they are getting 26.2ms/token (38 tokens per second) on 13B with a 4080.
> Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model

Have you quantized it?

The model I have is q4_0 I think that's 4 bit quantized

I'm running in Windows using koboldcpp, maybe it's faster in Linux?

I am running linux with cublast offload, and I am using the new 3 bit quant that was just pulled in a day or two ago.
Thanks! I'll have to try the 3bit to see if that helps
cuBLAS or CLBlast? There is no such thing as cublast
> The model I have is q4_0 I think that's 4 bit quantized

That's correct, yeah. Q4_0 should be the smallest and fastest quantized model.

> I'm running in Windows using koboldcpp, maybe it's faster in Linux?

Possibly. You could try using WSL to test—I think both WSL1 and WSL2 are faster than Windows (but WSL1 should be faster than WSL2).

I didn't know what WSL was, but now I do, thanks for the tip!
I'n on a Ryzen 4900HS laptop with a RTX 2060.

Like I said, very modest

Are you offloading layers to the RTX2060?
Some of them, yeah. 17 layers iirc.