| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by asabla 314 days ago
	I'm on a 5090 so it's not apples to apples comparison. But I'm getting ~150t/s for the 20B version using ~16000 context size.

2 comments

steinvakt2 314 days ago

And flash attention doesn't work on 5090 yet, right? So currently 4090 is probably faster, or?

link

PeterStuer 314 days ago

I don't think the 4090 has native 4bit support, which will probably have a significant impact.

link

diggan 314 days ago

> And flash attention doesn't work on 5090 yet, right?

Flash attention works with GPT-OSS + llama.cpp (tested on 1d72c8418) and other Blackwell card (RTX Pro 6000) so I think it should work on 5090 as well, it's the same architecture after all.

link

modeless 314 days ago

Cool, what software?

link

asabla 314 days ago

Initial testing has only been done with ollama. Plan on testing out llama.cpp and vllm when there is enough time

link