|
|
|
|
|
by YetAnotherNick
582 days ago
|
|
You get ~400TFLOP/s in H100 for 350W. You need (2 * token/s * param count) FLOP/s. For 405b, 969tok/s you just need 784 TFLOP/s which is just 2 H100s. The limiting factor with GPU for inference is memory bandwidth. For 969 tok/s in int8, you need 392 TB/s memory bandwidth or 200 H100s. |
|
Hence why you see AMD's MI325x coming out with 256GB HBM3e, but it is the same FLOPs as a 300x. 6TB/s too, which outperforms H200's, by a lot.
You can see the direction AMD is going with this...
https://www.amd.com/en/products/accelerators/instinct/mi300/...