Hacker News new | ask | show | jobs
by riku_iki 1181 days ago
> For example on 24GB, Llama 30B runs only in 4bit mode and very slowly

why do you think adding vram, but not cores will make it run faster?..

1 comments

I've been told the 4 bit quantization slows it down, but don't quote me on this since I was unable to benchmark at 8 bit locally

In any case, you're right it might not be as significant, however, the quality of the output increases with 8/16bit, and running 65B is completely impossible on 24GB

It's not impossible, there are several projects which load model layer by layer for execution from the disk or ram, but it will be much slower.