Hacker News new | ask | show | jobs
by asabla 314 days ago
I'm on a 5090 so it's not apples to apples comparison. But I'm getting ~150t/s for the 20B version using ~16000 context size.
2 comments

And flash attention doesn't work on 5090 yet, right? So currently 4090 is probably faster, or?
I don't think the 4090 has native 4bit support, which will probably have a significant impact.
> And flash attention doesn't work on 5090 yet, right?

Flash attention works with GPT-OSS + llama.cpp (tested on 1d72c8418) and other Blackwell card (RTX Pro 6000) so I think it should work on 5090 as well, it's the same architecture after all.

Cool, what software?
Initial testing has only been done with ollama. Plan on testing out llama.cpp and vllm when there is enough time