| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by enlyth 1184 days ago
	Exactly, we're just below that sweet spot right now. For example on 24GB, Llama 30B runs only in 4bit mode and very slowly, but I can imagine a RLHF finetuned 30B or 65B version running in at least 8bit would be actually useful, and you could run it on your own computer easily.

2 comments

bick_nyers 1184 days ago

Do you know where the cutoff is? Does 32GB VRAM give us 30B int8 with/without a RLHF layer? I don't think 5090 is going to go straight to 48GB, I'm thinking either 32 or 40GB (if not 24GB).

link

riku_iki 1184 days ago

> For example on 24GB, Llama 30B runs only in 4bit mode and very slowly

why do you think adding vram, but not cores will make it run faster?..

link

enlyth 1184 days ago

I've been told the 4 bit quantization slows it down, but don't quote me on this since I was unable to benchmark at 8 bit locally

In any case, you're right it might not be as significant, however, the quality of the output increases with 8/16bit, and running 65B is completely impossible on 24GB

link

riku_iki 1184 days ago

It's not impossible, there are several projects which load model layer by layer for execution from the disk or ram, but it will be much slower.

link