|
|
|
|
|
by ml_hardware
1544 days ago
|
|
At inference time it will be possible to do 4000 TFLOPS using sparse FP8 :) But keep in mind the model won't fit on a single H100 (80GB) because it's 175B params, and ~90GB even with sparse FP8 model weights, and then more needed for live activation memory. So you'll still want atleast 2+ H100s to run inference, and more realistically you would rent a 8xH100 cloud instance. But yeah the latency will be insanely fast given how massive these models are! |
|
Sounds doable in a generation or two.