| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brutus1213 269 days ago
	I recently got a 5090 with 64 GB of RAM (intel cpu). Was just looking for a strong model I can host locally. If I had performance of GPT4-o, I'd be content. Are there any suggestions or cases where people got disappointed?

2 comments

bogtog 269 days ago

GPT-OSS-20B at 4- or 8-bits is probably your best bet? Qwen3-30b-a3b probably the next best option. Maybe there exists some 1.7 or 2 bit version of GPT-OSS-120B

link

p1esk 269 days ago

5090 has 32GB of RAM. Not sure if that’s enough to fit this model.

link

IceWreck 269 days ago

LlamaCPP supports offloading some experts in a MoE model to CPU. The results are very good and even weaker GPUs can run larger models at reasonable speeds.

n-cpu-moe in https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...

link

svnt 269 days ago

It should fit enough of the layers to make it reasonably performant.

link