| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by modeless 321 days ago
	What's the best speed people have gotten on 4090s?

2 comments

asabla 321 days ago

I'm on a 5090 so it's not apples to apples comparison. But I'm getting ~150t/s for the 20B version using ~16000 context size.

link

steinvakt2 321 days ago

And flash attention doesn't work on 5090 yet, right? So currently 4090 is probably faster, or?

link

PeterStuer 321 days ago

I don't think the 4090 has native 4bit support, which will probably have a significant impact.

link

diggan 321 days ago

> And flash attention doesn't work on 5090 yet, right?

Flash attention works with GPT-OSS + llama.cpp (tested on 1d72c8418) and other Blackwell card (RTX Pro 6000) so I think it should work on 5090 as well, it's the same architecture after all.

link

modeless 321 days ago

Cool, what software?

link

asabla 321 days ago

Initial testing has only been done with ollama. Plan on testing out llama.cpp and vllm when there is enough time

link

ActorNightly 321 days ago

You can't fit the model into 4090 without quantization, its like 64 gigs.

For home use, Gemma27B QAT is king. Its almost as good as Deepseek R1

link

SirMaster 321 days ago

You don't really need it to fit all in VRAM due to the efficient MoE architecture and with llama.cpp

The 120B is running at 20 tokens/sec on my 5060Ti 16GB with 64GB of system ram. Now personally I find 20 tokens/sec quite usable, but for some maybe it's not enough.

link

dexterlagan 320 days ago

I have a similar setup but with 32 GB of RAM. Do you partly offload the model to RAM? Do you use LMStudio or other to achieve this? Thanks!

link

SirMaster 310 days ago

Yes, LMStudio and it automatically does this.

link

modeless 321 days ago

The 20B one fits.

link

steinvakt2 321 days ago

Does it fit on a 5080 (16gb)?