| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by skryl 384 days ago
	Performance per watt is better than h100 and b200, performance per watt per $ is worse than B200, and it does fp8 just fine https://arxiv.org/pdf/2503.11698

3 comments

skryl 383 days ago

One caveat is that this paper only covers training, which can be done on a single CS-3 using external memory (swapping weights in and out of SRAM). There is no way that a single CS-3 will hit this record inference performance with external memory so this was likely done with 10-20 CS-3 chips and the full model in SRAM. Definitely can’t compare token/$ with that kind of setup vs a DGX.

link

ryao 384 days ago

Thanks for the correction. They are currently using FP16 for inference according to OpenRouter. I had thought that implied that they could not use FP8 given the pressure that they have to use as little memory as possible from being solely reliant on SRAM. I wonder why they opted to use FP16 instead of FP8.

link

lern_too_spel 384 days ago

Performance per watt per dollar is a useless metric as calculated. You can't spend more money on B200s to get more performance per watt.

link