| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by riku_iki 1228 days ago
	> For example on 24GB, Llama 30B runs only in 4bit mode and very slowly why do you think adding vram, but not cores will make it run faster?..

1 comments

enlyth 1228 days ago

I've been told the 4 bit quantization slows it down, but don't quote me on this since I was unable to benchmark at 8 bit locally

In any case, you're right it might not be as significant, however, the quality of the output increases with 8/16bit, and running 65B is completely impossible on 24GB

link

riku_iki 1228 days ago

It's not impossible, there are several projects which load model layer by layer for execution from the disk or ram, but it will be much slower.

link