| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by treprinum 605 days ago
	4090 is 5x faster than M3 Max 128GB according to my tests but it can't even inference LLaMA-30B. The moment you hit that memory limit the inference is suddenly 30x slower than M3 Max. So a basic GPU with 128GB RAM would trash 4090 on those larger LLMs.

2 comments

skirmish 605 days ago

Quantized 30B models should run in 24GB VRAM. A quick search found people doing that with good speed: [1]

    I have a 4090, PCIe 3x16, DDR4 RAM.
    
    oobabooga/text-generation-webui
    using exllama
    I can load 30B 4bit GPTQ models and use full 2048 context
    I get 30-40 tokens/s

[1] https://old.reddit.com/r/LocalLLaMA/comments/14gdsxe/optimal...

link

treprinum 605 days ago

Quantized sure but there is some loss of variability of the output one can notice quickly with 30B models. If you want to use the fp16 version you are out of luck.

link

m00x 605 days ago

Do you have the code for that test?

link

treprinum 605 days ago

I ran some variation of llama.cpp that could handle large models by running portion of them on GPU and if too large, the rest on CPU and those were the results. Maybe I can dig it from some computer at home but it was almost like a year ago when I got M3 Max with 128GB RAM.

link