Hacker News new | ask | show | jobs
by Lindon4290 722 days ago
Going by the results from the article/video, a single MI300X is even outperforming a Groq system [1]

The video shows that the optimized run with Llama-2 70B gives 314 tokens/s for a bs=1 with 256 prompt + 256 generation. The Groq system is also a bs=1 apparatus and gets you around 300 tokens/s. Wild!

[1] https://wow.groq.com/groq-sets-new-large-language-model-perf...

1 comments

Someone else on reddit [0] noticed this as well.

Groq does not talk about how many cards they need to get those results. Someone replied to me with this comment [1] a while ago...

[0] https://www.reddit.com/r/AMD_MI300/comments/1dqhrbn/comment/...

[1] https://news.ycombinator.com/item?id=39966620

Yeah, the rumours(?) are a groq system required to produce 300+ t/s on a Llama-2 70B (bs=1) requires 576 chips (9 racks) [1]

So, that's like $10M+ for serving bs=1 Llama-2 70B vs whatever a single MI300X costs?

[1] https://twitter.com/swyx/status/1759759125314146699

The exact cost of a mi300x is closely guarded by amd. I buy them and do not know how much they are. That said a whole chassis of 8x is far far far less than 10m.
You should make a whole post about this! Like how a single MI300X outperforms groq at bs=1.

300 tokens/s with bs=1 for a llama-2 70B on a single card is no joke.

This is why I sponsored doing the chipsandcheese tests on my hardware. That instigated Elio to up the game even further.

All open source by the way.

Thank you for sponsoring this. There's so little buzz about this hardware despite the fact it's clearly amazing for AI use cases. I don't understand why not. Maybe this is why Nvidia is the most valuable company in the world - nobody can be bothered to try a competitor.