Hacker News new | ask | show | jobs
by modeless 321 days ago
What's the best speed people have gotten on 4090s?
2 comments

I'm on a 5090 so it's not apples to apples comparison. But I'm getting ~150t/s for the 20B version using ~16000 context size.
And flash attention doesn't work on 5090 yet, right? So currently 4090 is probably faster, or?
I don't think the 4090 has native 4bit support, which will probably have a significant impact.
> And flash attention doesn't work on 5090 yet, right?

Flash attention works with GPT-OSS + llama.cpp (tested on 1d72c8418) and other Blackwell card (RTX Pro 6000) so I think it should work on 5090 as well, it's the same architecture after all.

Cool, what software?
Initial testing has only been done with ollama. Plan on testing out llama.cpp and vllm when there is enough time
You can't fit the model into 4090 without quantization, its like 64 gigs.

For home use, Gemma27B QAT is king. Its almost as good as Deepseek R1

You don't really need it to fit all in VRAM due to the efficient MoE architecture and with llama.cpp

The 120B is running at 20 tokens/sec on my 5060Ti 16GB with 64GB of system ram. Now personally I find 20 tokens/sec quite usable, but for some maybe it's not enough.

I have a similar setup but with 32 GB of RAM. Do you partly offload the model to RAM? Do you use LMStudio or other to achieve this? Thanks!
Yes, LMStudio and it automatically does this.
The 20B one fits.
Does it fit on a 5080 (16gb)?
Haven't tried myself but it looks like it probably does. The weight files total 13.8 GB which gives you a little left over to hold your context.
It fits on a 5070TI, so should fit on a 5080 as well.