| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by azeirah 829 days ago
	From how I understood it, it means they optimised the entire stack from CUDA to the networking interconnects specifically for data centers, meaning you get 30x more inference per dollar for a datacenter. This is probably not fluff, but it's only relevant for a very very specific use-case, ie enterprises with the money to buy a stack to serve thousands of users with LLMs. It doesn't matter for anyone who's not microsoft, aws or openai or similar.

2 comments

misterdabb 829 days ago

It's a weird graph... It's specifically tokens per GPU but the x-axis is "interactivity per second", so the y-axis is including Blackwell being twice the size and also the increase from fp8 -> fp4, note it will needs to be counted multiple time as half as much data is needed to be going through the networks as well.

link

acchow 829 days ago

They showed 30x was for FP4. Who is using FP4 in practice?

link

KaoruAoiShiho 829 days ago

But maybe you should. Once the software stack is ready for it there'll be more people since the performance gains are so massive.

link

dagmx 828 days ago

It would depend highly on the model though. Some stuff will generalize better to FP4 than others.

link